SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Lu Zhonghai) "

Sökning: WFRF:(Lu Zhonghai)

  • Resultat 1-50 av 294
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Chen, X., et al. (författare)
  • Achieving memory access equalization via round-trip routing latency prediction in 3D many-core NoCs
  • 2015
  • Ingår i: Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI. - : IEEE. ; , s. 398-403
  • Konferensbidrag (refereegranskat)abstract
    • 3D many-core NoCs are emerging architectures for future high-performance single chips due to its integration of many processor cores and memories by stacking multiple layers. In such architecture, because processor cores and memories reside in different locations (center, corner, edge, etc.), memory accesses behave differently due to their different communication distances, and the performance (latency) gap of different memory accesses becomes larger as the network size is scaled up. This phenomenon may lead to very high latencies suffered from by some memory accesses, thus degrading the system performance. To achieve high performance, it is crucial to reduce the number of memory accesses with very high latencies. However, this should be done with care since shortening the latency of one memory access can worsen the latency of another as a result of shared network resources. Therefore, the goal should focus on narrowing the latency difference of memory accesses. In the paper, we address the goal by proposing to prioritize the memory access packets based on predicting the round-trip routing latencies of memory accesses. The communication distance and the number of the occupied items in the buffers in the remaining routing path are used to predict the round-trip latency of a memory access. The predicted round-trip routing latency is used as the base to arbitrate the memory access packets so that the memory access with potential high latency can be transferred as early and fast as possible, thus equalizing the memory access latencies as much as possible. Experiments with varied network sizes and packet injection rates prove that our approach can achieve the goal of memory access equalization and outperforms the classic round-robin arbitration in terms of maximum latency, average latency, and LSD1. In the experiments, the maximum improvement of the maximum latency, the average latency and the LSD are 80%, 14%, and 45% respectively.
  •  
2.
  • Liu, Ming, et al. (författare)
  • ATCA-based Computation Platform for Data Acquisition and Triggering in Particle Physics Experiments
  • 2008
  • Ingår i: 2008 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE AND LOGIC APPLICATIONS, VOLS 1 AND 2. ; , s. 287-292
  • Konferensbidrag (refereegranskat)abstract
    • An ATCA-based computation platform for data acquisition and trigger applications in nuclear and particle physics experiments has been developed. Each Compute Node (CN) which appears as a Field Replaceable Unit (FRU) in an ATCA shelf, features 5 Xilinx Virtex-4 FX60 FPGAs and up to 10 GBytes DDR2 memory. Connectivity is provided with 8 optical links and 5 Gigabit Ethernet ports, which are mounted on each board to receive data from detectors and forward results to outer shelves or PC farms with attached mass storage. Fast point-to-point on-board interconnections between FPGAs as well as the full-mesh shelf backplane provide flexibility and high bandwidth to partition algorithms and correlate results among them. The system represents a highly reconfigurable and scalable solution for multiple applications.
  •  
3.
  • Liu, Ming, et al. (författare)
  • Trigger algorithm development on FPGA-based Compute Nodes
  • 2009
  • Ingår i: 2009 16th IEEE-NPSS Real Time Conference. - New York : IEEE. - 9781424457960 ; , s. 478-484
  • Konferensbidrag (refereegranskat)abstract
    • Based on the ATCA computation architecture and Compute Nodes (CN), investigation and implementation work has been being executed for HADES and PANDA trigger algorithms. We present our designs for HADES track reconstruction processing, Cherenkov ring recognition, Time-Of-Flight processing, electromagnetic shower recognition.. and the PANDA straw tube tracking algorithm. They will appear as co-processors in the uniform system design to undertake the detector-specific computing. The algorithm principles will be explained and hardware designs are described in the paper. The current progress reveals the feasibility to implement these algorithms on FPGAs. Also experimental results demonstrate the performance speedup when compared to alternative software solutions, as well as the potential capability of high-speed parallel/pipelined processing in Data Acquisition and Trigger systems.
  •  
4.
  • Liu, Weihua, et al. (författare)
  • Characterizing the Reliability and Threshold Voltage Shifting of 3D Charge Trap NAND Flash
  • 2019
  • Ingår i: 2019 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE). - : IEEE. - 9783981926323 ; , s. 312-315
  • Konferensbidrag (refereegranskat)abstract
    • 3D charge trap (CT) triple-level cell (TLC) NAND flash gradually becomes a mainstream storage component due to high storage capacity and performance, but introducing a concern about reliability. Fault tolerance and data management schemes are capable of improving reliability. Designing a more efficient solution, however, needs to understand the reliability characteristics of 3D CT TLC NAND flash. To facilitate such understanding, by exploiting a real-world testing platform, we investigate the reliability characteristics including the raw bit error rate (RBER) and the threshold voltage (Vth) shifting features after suffering from variable disturbances. We give analyses of why these characteristics exist in 3D CT TLC NAND flash. We hope these observations can guide the designers to propose high efficient solutions to the reliability problem.
  •  
5.
  • Wang, Qiang, et al. (författare)
  • Hardware/Software Co-design of an ATCA-based Computation Platform for Data Acquisition and Triggering
  • 2009
  • Ingår i: 16th IEEE NPSS Real Time Conference. - 9781424457960 ; , s. 485-489
  • Konferensbidrag (refereegranskat)abstract
    • An ATCA-based computation platform for data acquisition and trigger(TDAQ) applications has been developed for multiple future projects such its PANDA. HADES, and BESIII. Each Compute Node (CN) appears as one (if the fourteen Field Replaceable Units (FRU) in an ATCA shelf, which in total features a high performance of 1890 Clips inter-FPGA on-board channels, 1456 Gbps inter-board backplane connections, 728 Gbps full-duplex optical links, 70 Gbps Ethernet. 140 GBytes DDR2 SDRAM. and all computing resources of 70 Xilinx Virtex-4 FX60 FPGAs. Corresponding to (the system architecture, a hardware/software co-design approach is proposed to ease and accelerate the development for different experiments. In the uniform system design. application-specific computation is to be implemented as customized hardware co-processors, while the embedded PowerPC processor takes charge of flexible slow controls and transmission protocol processing.
  •  
6.
  • Anagnostopoulos, Iraklis, et al. (författare)
  • Custom Microcoded Dynamic Memory Management for Distributed On-Chip Memory Organizations
  • 2011
  • Ingår i: IEEE Embedded Systems Letters. - 1943-0663. ; 3:2, s. 66-69
  • Tidskriftsartikel (refereegranskat)abstract
    • Multiprocessor system-on-chip (MPSoCs) have attracted significant attention since they are recognized as a scalable paradigm to interconnect and organize a high number of cores. Current multicore embedded systems exhibit increased levels of dynamicbehavior, leading to unexpected memory footprint variations unknown at design time.Dynamic memory management (DMM) is a promising solution for such types of dynamicsystems. Although some efficient dynamic memory managers have been proposed for conventional bus-based MPSoC platforms, there are no DMM solutions regarding the constraints and the opportunities delivered by the physical distribution of multiple memorynodes of the platform. In this work, we address the problem of providing customizedmicrocoded DMM on MPSoC platforms with distributed memory organization. Customization is enabled at application-and platform-level. Results show that customizedmicrocoded DMM can serve approximately 7× more allocation requests compared to puredistributed memory platforms and perform 25% faster than the corresponding high-level implementation in C language. 
  •  
7.
  • Badawi, Mohammad, 1981- (författare)
  • Adaptive Coarse-grain Reconfigurable Protocol Processing Architecture
  • 2016
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Digital signal processors and their variants have provided significant benefit to efficient implementation of Physical Layer (PHY) of Open Systems Interconnection (OSI) model’s seven-layer protocol processing stack compared to the general purpose processors. Protocol processors promise to provide a similar advantage for implementing higher layers in the (OSI)'s seven-layer model. This thesis addresses the problem of designing customizable coarse-grain reconfigurable protocol processing fabrics as a solution to achieving high performance and computational efficiency. A key requirement that this thesis addresses is the ability to not only adapt to varying applications and standards, and different modes in each standard but also to time varying load and performance demands while maintaining quality of service.This thesis presents a tile-based multicore protocol processing architecture that can be customized at design time to meet the requirements of the target application. The architecture can then be reconfigured at boot time and tuned to suit the desired use-case. This architecture includes a packet-oriented memory system that has deterministic access time and access energy costs, and hence can be accurately dimensioned to fulfill the requirements of the desired use-case. Moreover, to maintain quality of service as predicted, while minimizing the use of energy and resources, this architecture encompasses an elastic management scheme that controls run-time configuration to deploy processing resources based on use-case and traffic demands.To evaluate the architecture presented in this thesis, different case studies were conducted while quantitative and qualitative metrics were used for assessment. Energy-delay product, energy efficiency, area efficiency and throughput show the improvements that were achieved using the processing cores and the memory of the presented architecture, compared with other solutions. Furthermore, the results show the reduction in latency and power consumption required to evaluate controlling states when using the elastic management scheme. The elasticity of the scheme also resulted in reducing the total area required for the controllers that serve multiple processing cores in comparison with other designs. Finally, the results validate the ability of the presented architecture to support quality of service without misutilizing available energy during a real-life case study of a multi-participant Voice Over Internet Protocol (VOIP) call.
  •  
8.
  • Badawi, Mohammad, et al. (författare)
  • Customizable Coarse-grained Energy-efficient Reconfigurable Packet Processing Architecture
  • 2014
  • Ingår i: Proceedings Of The 2014 IEEE 25th International Conference on Application-specific Systems, Architectures and Processors (ASAP). - : IEEE. ; , s. 30-35
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we present a highly customizable and rapidly reconfigurable multi-core packet processing architecture that provides energy and area efficiency while retaining flexibility. Presented architecture with its agile reconfigurability permits time-critical adaptability where resources can be re-clustered at run time in few cycles, hence, maintaining efficiency if requirements of the use-case change. We elaborate the flexibility and adaptability of our architecture and we report its evaluation results. For evaluation, we performed the widely-used UDP/IP and we compared our proposed architecture to low-power 32-bit general purpose processors, a custom ASIC implementation and a programmable protocol processor. Compared to GPP-based solutions, our architecture is 20-34 times more energy efficient while providing 2.4-4.1 times higher throughput. While retaining the programmability, the proposed solution achieved 78% of the energy efficiency of hardwired ASIC implementation. Compared to a programmable protocol processor, our solution has 2.6 times more throughput and requires only a third of the gate count. lastly, we quantified the worst-case time and average-case time required for time-critical adaptability when reconfiguration occurs during a real-life Voice-Over IP traffic.
  •  
9.
  • Badawi, Mohammad, et al. (författare)
  • Elastic Management and QoS Provisioning Scheme for Adaptable Multi-core Protocol Processing Architecture
  • 2016
  • Ingår i: 19TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD 2016). - : IEEE. - 9781509028160 ; , s. 575-583
  • Konferensbidrag (refereegranskat)abstract
    • Adaptable protocol processing architectures can offer quality-of-service (QoS) while improving energy efficiency and resource utilization. However, a key condition for adaptable architectures to support QoS is that, the latency required for processor adaptation does not result in violating packet processing delay bound. Moreover, adaptation latency must not cause packets to accumulate until memory becomes full and packets are dropped. In this paper, we present an elastic management scheme for agile adaptable multi-core protocol processing architecture to facilitate processor adaptation when QoS has to be maintained. The proposed management scheme encompasses a set of reconfigurable finite state machines (FSMs) and each is dimensioned to associate single processing element (PE). During processor adaptation, the needed FSMs can rapidly be clustered to provide the control needed for the newly adapted structure. We use a real-life application to demonstrate how our proposed management scheme supports maintaining QoS during processor adaptation. We also quantify the time needed for processor adaptation as well as the reduction in energy, latency and area achieved when using our scheme.
  •  
10.
  • Badawi, Mohammad, et al. (författare)
  • Quality-of-service-aware adaptation scheme for multi-core protocol processing architecture
  • 2017
  • Ingår i: Microprocessors and microsystems. - : Elsevier. - 0141-9331 .- 1872-9436. ; 54, s. 47-59
  • Tidskriftsartikel (refereegranskat)abstract
    • Employing adaptable protocol processing architectures has shown a high potential in provisioning Quality-of-Service (QoS) while retaining efficient use of available energy budget. Nevertheless, successful QoS provisioning using adaptable protocol processing architectures requires adaption to be agile and to have low latency. That is, a long adaptation latency might lead to violating desired packet processing latency, desired throughput or loss of packets if the memory fails to accommodate packet accumulation. This paper presents an elastic management scheme to permit agile and QoS-aware adaptation of processing elements (PEs) within the protocol processing architecture, such that desired QoS is maintained. Moreover, our proposed scheme has the potential to reduce energy consumption since it employs the PEs upon demand. We quantify the latency required for PEs adaptation, the reduction in energy and the reduction in area that can be achieved using our scheme. We also consider two different real-life use cases to demonstrate the effectiveness of our proposed management scheme in maintaining QoS while conserving available energy.
  •  
11.
  • Badawi, Mohammad, et al. (författare)
  • Service-Guaranteed Multi-Port PacketMemory for Parallel Protocol Processing Architecture
  • 2016
  • Ingår i: Proceedings - 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2016. - : Institute of Electrical and Electronics Engineers (IEEE). - 9781467387750 ; , s. 408-412
  • Konferensbidrag (refereegranskat)abstract
    • Parallel processing architectures have been increasingly utilized due to their potential for improving performance and energy efficiency. Unfortunately, the anticipated improvement often suffers from a limitation caused by memory access latency and latency variation, which consequently impact Quality of Service (QoS). This paper presents a service-guaranteed multi-port packet memory system to boost parallelism in protocol processing architectures. In this proposed memory system, all arriving packets are guaranteed a memory space, such that, a packet memory space can be allocated in a bounded number of cycles and each of its locations is accessible in a single cycle. We consider a real-time Voice Over Internet Protocol (VOIP) call as a case-study to evaluate our service-guaranteed memory system.
  •  
12.
  • Becker, Matthias, 1986-, et al. (författare)
  • An adaptive resource provisioning scheme for industrial SDN networks
  • 2019
  • Ingår i: IEEE International Conference on Industrial Informatics (INDIN). - : Institute of Electrical and Electronics Engineers Inc.. - 9781728129273 ; , s. 877-880
  • Konferensbidrag (refereegranskat)abstract
    • Many industrial domains face the challenge of ever growing networks, driven for example by Internet-of-Things and Industry 4.0. This typically comes together with increased network configuration and management efforts. In addition to the increasing network size, these domains typically are subject to adaptive load situations that pose an additional challenge on the network infrastructure.Software defined networking (SDN) is a promising networking paradigm that reduces configuration complexity and management effort in Ethernet networks. In this work, we investigate SDN in context of adaptive scenarios with QoS constraints. Our approach applies monitoring of several thresholds which automatically trigger redistribution of resources via the central SDN controller. This setup leads to an agile system that can dynamically react to load changes while the infrastructure is not overprovisioned. The approach is implemented in a low-level simulation environment where we demonstrate the benefits of the approach using a case study.
  •  
13.
  • Becker, Matthias, 1986-, et al. (författare)
  • Towards QoS-Aware Service-Oriented Communication in E/E Automotive Architectures
  • 2018
  • Ingår i: Proceedings of the 44th Annual Conference of the IEEE Industrial Electronics Society (IECON). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 4096-4101
  • Konferensbidrag (refereegranskat)abstract
    • With the raise of increasingly advanced driving assistance systems in modern cars, execution platforms that build on the principle of service-oriented architectures are being proposed. Alongside, service oriented communication is used to provide the required adaptive communication infrastructure on top of automotive Ethernet networks. A middleware is proposed that enables QoS aware service-oriented communication between software components, where the prescribed behavior of each software component is defined by Assume/Guarantee (A-G) contracts. To enable the use of COTS components, that are often not sufficiently verified for the use in automotive systems, the middleware monitors the communication behavior of components and verifies it against the components A/G contract. A violation of the allowed communication behavior then triggers adaption processes in the system while the impact on other communication is minimized. The applicability of the approach is demonstrated by a case study that utilizes a prototype implementation of the proposed approach.
  •  
14.
  • Candaele, Bernard, et al. (författare)
  • Mapping Optimisation for Scalable multi-core ARchiTecture : The MOSART approach
  • 2010
  • Ingår i: Proceedings - IEEE Annual Symposium on VLSI, ISVLSI 2010. - 9780769540764 ; , s. 518-523
  • Konferensbidrag (refereegranskat)abstract
    • The project will address two main challenges of prevailing architectures: 1) The global Interconnect and memory bottleneck due to a single, globally shared memory with high access times and power consumption; 2) The difficulties in programming heterogeneous, multi-core platforms, in particular in dynamically managing data structures in distributed memory. MOSART aims to overcome these through a multi-core architecture with distributed memory organisation, a Network-on-Chip (NoC) communication backbone and configurable processing cores that are scaled, optimised and customised together to achieve diverse energy, performance, cost and size requirements of different classes of applications. MOSART achieves this by: A) Providing platform support for management of abstract data structures Including middleware services and a run-time data manager for NoC based communication infrastructure; 2) Developing tool support for parallelizing and mapping applications on the multi-core target platform and customizing the processing cores for the application.
  •  
15.
  • Candaele, Bernard, et al. (författare)
  • The MOSART Mapping Optimization for multi-core Architectures
  • 2011
  • Ingår i: VLSI 2010 Annual Symposium. - Dordrecht : Springer Publishing Company. ; , s. 181-195
  • Konferensbidrag (refereegranskat)abstract
    • MOSART project addresses two main challenges of prevailing architectures: (i) Theglobal interconnect and memory bottleneck due to a single, globally shared memorywith high access times and power consumption; (ii) The difficulties in programmingheterogeneous, multi-core platforms MOSART aims to overcome these through amulti-core architecture with distributed memory organization, a Network-on-Chip(NoC) communication backbone and configurable processing cores that are scaled,optimized and customized together to achieve diverse energy, performance, cost andsize requirements of different classes of applications. MOSART achieves this by:(i) Providing platform support for management of abstract data structures includingmiddleware services and a run-time data manager for NoC based communicationinfrastructure; (ii) Developing tool support for parallelizing and mapping applicationson the multi-core target platform and customizing the processing cores for theapplication.
  •  
16.
  • Chen, Dejiu, et al. (författare)
  • A methodological framework for model-based self-management of services and components in dependable cyber-physical systems
  • 2017
  • Ingår i: 12th International Conference on Dependability and Complex Systems, DepCoS-RELCOMEX 2017. - Cham : Springer. - 9783319594149 ; , s. 97-105
  • Konferensbidrag (refereegranskat)abstract
    • Modern automotive vehicles featuring ADAS (Advanced Driving Assistant Systems) and AD (Autonomous Driving) represent one category of dependable CPS (Cyber-Physical Systems). For such systems, the adaptation of generic purpose COTS (Commercial-Off-The-Shelf) services and components has been advocated in the industry as a necessary means for shortening the innovation loops and enabling efficient product evolution. This will however not be a trivial task due to the system safety- and time-criticality. This calls on one hand for formal specification of systems, and on the other hand for a systematic approach to module design, supervision and adaptions. Accordingly, we propose in this paper a novel method that emphasizes an integration of system models, formal contracts, and embedded services for effective self-management of COTS. The key modeling technologies include the EAST-ADL for formal system description and the A-G contract theory for module specification.
  •  
17.
  • Chen, DeJiu, et al. (författare)
  • IMBSA 2017: Model-Based Safety and Assessment
  • 2017
  • Ingår i: Model-Based Safety and Assessment - 5th International Symposium, Trento, Italy, September 11–13, 2017. - Cham : Springer. - 9783319641188 ; , s. 227-240
  • Konferensbidrag (refereegranskat)abstract
    • Modern automotive vehicles represent one category of CPS (Cyber-Physical Systems) that are inherently time- and safety-critical. To justify the actions for quality-of-service adaptation and safety assurance, it is fundamental to perceive the uncertainties of system components in operation, which are caused by emergent properties, design or operation anomalies. From an industrial point of view, a further challenge is related to the usages of generic purpose COTS (Commercial-Off-The-Shelf) components, which are separately developed and evolved, often not sufficiently verified and validated for specific automotive contexts. While introducing additional uncertainties in regard to the overall system performance and safety, the adoption of COTS components constitutes a necessary means for effective product evolution and innovation. Accordingly, we propose in this paper a novel approach that aims to enable advanced operation monitoring and self-assessment in regard to operational uncertainties and thereby automated performance and safety awareness. The emphasis is on the integration of several modeling technologies, including the domain-specific modeling framework EAST-ADL, the A-G contract theory and Hidden Markov Model (HMM). In particular, we also present some initial concepts in regard to the usage performance and safety awareness for quality-of-service adaptation and dynamic risk mitigation.
  •  
18.
  • Chen, Hui, et al. (författare)
  • A CORDIC-Based Architecture with Adjustable Precision and Flexible Scalability to Implement Sigmoid and Tanh Functions
  • 2020
  • Ingår i: IEEE International Symposium on Circuits and Systems, ISCAS 2020. - : IEEE.
  • Konferensbidrag (refereegranskat)abstract
    • In the artificial neural networks, tanh (hyperbolic tangent) and sigmoid functions are widely used as activation functions. Past methods to compute them may have shortcomings such as low precision or inflexible architecture that is difficult to expand, so we propose a CORDIC-based architecture to implement sigmoid and tanh functions, which has adjustable precision and flexible scalability. It just needs shift-add-or-subtract operations to compute high-accuracy results and is easy to expand the input range through scaling the negative iterations of CORDIC without changing the original architecture. We adopt the control variable method to explore the accuracy distribution through software simulation. A specific case (ARCH:(1, 15, 18), RMSE: 10(-6)) is designed and synthesized under the TSMC 40nm CMOS technology, the report shows that it has the area of 36512.78 mu m(2) and power of 12.35mW at the frequency of 1GHz. The maximum work frequency can reach 1.5GHz, which is better than the state-of-the-art methods.
  •  
19.
  • Chen, Hui, et al. (författare)
  • A General Methodology and Architecture for Arbitrary Complex Number Nth Root Computation
  • 2021
  • Ingår i: 2021 SCAS 2021/IEEE International Symposium on Circuits and Systems. - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • As the existing complex number Nth root computation methods are relatively discrete, we propose a general method and architecture based on coordinate rotation digital computer (CORDIC) to compute arbitrary complex number Nth root for the first time. Our method performs the tasks of computing complex modulus, complex phase angle, real Nth root, sine function and cosine function, which can be implemented by circular CORDIC, linear CORDIC and hyperbolic CORDIC. Based on these CORDICs, our proposed architecture can not only improve the hardware efficiency just through shift-add operations, but also flexibly adjust the precision and the input range of complex number Nth root. To prove its feasibility, we conduct a software simulation and implement an example circuit in hardware. Under the TSMC 28nm CMOS technology, we synthesize it and get the report that it has the area of 6561 mu m(2) and the power of 3.95mW at the frequency of 1.5GHz.
  •  
20.
  • Chen, Hui, et al. (författare)
  • An Efficient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions
  • 2020
  • Ingår i: Electronics. - : MDPI. - 2079-9292. ; 9:10
  • Tidskriftsartikel (refereegranskat)abstract
    • The efficient and precise hardware implementations of tanh and sigmoid functions play an important role in various neural network algorithms. Different applications have different requirements for accuracy. However, it is difficult for traditional methods to achieve adjustable precision. Therefore, we propose an efficient-hardware, adjustable-precision and high-speed architecture to implement them for the first time. Firstly, we present two methods to implement sigmoid and tanh functions. One is based on the rotation mode of hyperbolic CORDIC and the vector mode of linear CORDIC (called RHC-VLC), another is based on the carry-save method and the vector mode of linear CORDIC (called CSM-VLC). We validate the two methods by MATLAB and RTL implementations. Synthesized under the TSMC 40 nm CMOS technology, we find that a special case AR divide VR(3,0), based on RHC-VLC method, has the area of 4290.98 mu m2 and the power of 1.69 mW at the frequency of 1.5 GHz. However, under the same frequency, AR divide VC(3) (a special case based on CSM-VLC method) costs 3196.36 mu m2 area and 1.38 mW power. They are both superior to existing methods for implementing such an architecture with adjustable precision.
  •  
21.
  • Chen, H., et al. (författare)
  • Huicore : A Generalized Hardware Accelerator for Complicated Functions
  • 2022
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-8328 .- 1558-0806. ; 69:6, s. 2463-2476
  • Tidskriftsartikel (refereegranskat)abstract
    • Emerging advanced System-on-Chip (SoC) designs contain more and more complicated functions to be accelerated. This presents a challenge to conventional design approaches which use different hardware architectures or separate hardware accelerators to implement the various functions. To tackle this challenge, for the first time, we propose a generalized hardware accelerator called 'Huicore' to speed up diverse functions on the same substrate. Through the analysis and transformation of mathematical characteristics, we reveal the commonality of many complicated functions using the CORDIC algorithm. Then we explore a reconfigurable architecture to implement them. The proposed reconfigurable accelerator can not only accelerate the implementation of many complicated functions, but also has small area, low power consumption and high precision. It is very suitable for integration in a SoC system to accelerate the implementation of various applications.
  •  
22.
  • Chen, Hui, et al. (författare)
  • Hyperbolic CORDIC-Based Architecture for Computing Logarithm and Its Implementation
  • 2020
  • Ingår i: IEEE Transactions on Circuits and Systems - II - Express Briefs. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-7747 .- 1558-3791. ; 67:11, s. 2652-2656
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a CORDIC (Coordinate Rotation Digital Computer)-based method to compute the logarithm function with base 2 and validate this method by software simulation and hardware implementation. Technically, we overcome the limitation of traditional hyperbolic CORDIC and transform it based on the idea of generalized hyperbolic CORDIC so that it can be used to compute $log_{2}x\;(x\;\epsilon \;[1,2))$ . The proposed method requires only simple shift-and-add operations and has a great tradeoff between precision (or speed) and area. In MATLAB, we provide different precisions corresponding to the iterations of the transformed CORDIC for user needs. Using a pipelined structure and setting the number of iterations to be 16 (the average relative error is $2.09\times 10<^>{-6}$ ), we implement an example hardware circuit. Synthesized under the SMIC 65nm CMOS technology, the circuit has an area of 24100 $\mu m<^>{2}$ and computation time of 11.1 ns, which can save 31.04x0025; area and improve 6.92x0025; computation speed averagely compared with existing methods.
  •  
23.
  • Chen, Hui, et al. (författare)
  • Low-Complexity High-Precision Method and Architecture for Computing the Logarithm of Complex Numbers
  • 2021
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-8328 .- 1558-0806. ; 68:8, s. 3293-3304
  • Tidskriftsartikel (refereegranskat)abstract
    • This paper proposes a low-complexity method and architecture to compute the logarithm of complex numbers based on coordinate rotation digital computer (CORDIC). Our method takes advantage of the vector mode of circular CORDIC and hyperbolic CORDIC, which only needs shift-add operations in its hardware implementation. Our architecture has lower design complexity and higher performance compared with conventional architectures. Through software simulation, we show that this method can achieve high precision for logarithm computation, reaching the relative error of 10(-7). Finally, we design and implement an example circuit under TSMC 28nm CMOS technology. According to the synthesis report, our architecture has smaller area, lower power consumption, higher precision and wider operation range compared with the alternative architectures.
  •  
24.
  • Chen, Hui, et al. (författare)
  • Symmetric-Mapping LUT-Based Method and Architecture for Computing X-Y-Like Functions
  • 2021
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 1549-8328 .- 1558-0806. ; 68:3, s. 1231-1244
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose a new method and hardware architecture to compute the functions expressed as XY ( X and Y are arbitrary floating-point numbers), which can support arbitrary Nth root, exponential and power operations. Because of the complexity of direct computation, we usually convert it to logarithm, multiplication, and antilogarithm operations. Traditional approaches suffer from long latency, large area and high power consumption. To solve this problem, we propose a symmetric-mapping lookup table (SM-LUT) to be capable of computing log(2) x (x is an element of [1, 2]) and 2 x (x is an element of [0, 1]) simultaneously. It lays the foundation for computing XY. To further improve hardware performance of our architecture, we propose a multi-region address searcher to speed up the calculation of SM-LUT. In addition, we use an optimized Vedic multiplier to shorten the critical path and improve the efficiency of multiplication, which is included in computing X-Y. Under the TSMC 40nm CMOS technology, we design and synthesize a reference circuit to compute X-Y with a maximum relative error of 10(-3). The report shows that the reference circuit achieves the area of 14338.50 mu m(2) and the power consumption of 4.59 mW at the frequency of 1 GHz. In comparison with the state-of-the-art work under the same input range and similar precision, it saves 78.57% area and 80.42% power consumption for (N)root R computation and 82.89% area and 81.89% power consumption for R-N computation averagely. On top of that, our architecture reduces the computation latency by 62.77% averagely and has one more order of magnitude of energy efficiency than others.
  •  
25.
  • Chen, Qinyu, et al. (författare)
  • An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective
  • 2020
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 1063-8210 .- 1557-9999. ; 28:6, s. 1540-1544
  • Tidskriftsartikel (refereegranskat)abstract
    • Convolutional neural networks (CNNs) have emerged as one of the most popular ways applied in many fields. These networks deliver better performance when going deeper and larger. However, the complicated computation and huge storage impede hardware implementation. To address the problem, quantized networks are proposed. Besides, various convolutional structures are designed to meet the requirements of different applications. For example, compared with the traditional convolutions (CONVs) for image classification, CONVs for image generation are usually composed of traditional CONVs, dilated CONVs, and transposed CONVs, leading to a difficult hardware mapping problem. In this brief, we translate the difficult mapping problem into the sparsity problem and propose an efficient hardware architecture for sparse binary and ternary CNNs by exploiting the sparsity and low bit-width characteristics. To this end, we propose an ineffectual data removing (IDR) mechanism to remove both the regular and irregular sparsity based on dual-channel processing elements (PEs). Besides, a flexible layered load balance (LLB) mechanism is introduced to alleviate the load imbalance. The accelerator is implemented with 65-nm technology with a core size of 2.56 mm(2). It can achieve 3.72-TOPS/W energy efficiency at 50.1 mW, which makes it a promising design for embedded devices.
  •  
26.
  • Chen, Qinyu, et al. (författare)
  • An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks
  • 2019
  • Ingår i: Electronics. - : MDPI. - 2079-9292. ; 8:4
  • Tidskriftsartikel (refereegranskat)abstract
    • Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of . It can achieve TOPS/W energy efficiency and area efficiency at 100.1mW, which makes it a promising design for the embedded devices.
  •  
27.
  • Chen, Qinyu, et al. (författare)
  • Enabling Energy-Efficient Inference for Self-Attention Mechanisms in Neural Networks
  • 2022
  • Ingår i: 2022 Ieee International Conference On Artificial Intelligence Circuits And Systems (Aicas 2022). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 25-28
  • Konferensbidrag (refereegranskat)abstract
    • The study of specialized accelerators tailored for neural networks is becoming a promising topic in recent years. Such existing neural network accelerators are usually designed for convolutional neural networks (CNNs) or recurrent neural networks have been (RNNs), however, less attention has been paid to the attention mechanisms, which is an emerging neural network primitive with the ability to identify the relations within input entities. The self-attention-oriented models such as Transformer have achieved great performance on natural language processing, computer vision and machine translation. However, the self-attention mechanism has intrinsically expensive computational workloads, which increase quadratically with the number of input entities. Therefore, in this work, we propose an software-hardware co-design solution for energy-efficient self-attention inference. A prediction-based approximate self-attention mechanism is introduced to substantially reduce the runtime as well as power consumption, and then a specialized hardware architecture is designed to further increase the speedup. The design is implemented on a Xilinx XC7Z035 FPGA, and the results show that the energy efficiency is improved by 5.7x with less than 1% accuracy loss.
  •  
28.
  • Chen, Qinyu, et al. (författare)
  • Smilodon : An Efficient Accelerator for Low Bit-Width CNNs with Task Partitioning
  • 2019
  • Ingår i: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS). - : IEEE. - 9781728103976
  • Konferensbidrag (refereegranskat)abstract
    • Convolutional Neural Networks (CNNs) have been widely applied in various fields such as image and video recognition, recommender systems, and natural language processing. However, the massive size and intensive computation loads prevent its feasible deployment in practice, especially on the embedded systems. As a highly competitive candidate, low bit-width CNNs are proposed to enable efficient implementation. In this paper, we propose Smilodon, a scalable, efficient accelerator for low bit-width CNNs based on a parallel streaming architecture, optimized with a task partitioning strategy. We also present the 3D systolic-like computing arrays fitting for convolutional layers. Our design is implemented on Zynq XC7ZO20 FPGA, which can satisfy the needs of real-time with a frame rate of 1, 622 FPS throughput, while consuming 2.1 Watt. To the best of our knowledge, our accelerator is superior to the state-of-the-art works in the tradeoff among throughput, power efficiency, and area efficiency.
  •  
29.
  • Chen, S., et al. (författare)
  • Hardware acceleration of multilayer perceptron based on inter-layer optimization
  • 2019
  • Ingår i: Proceedings - 2019 IEEE International Conference on Computer Design, ICCD 2019. - : Institute of Electrical and Electronics Engineers Inc.. - 9781538666487 ; , s. 164-172
  • Konferensbidrag (refereegranskat)abstract
    • Multilayer Perceptron (MLP) is used in a broad range of applications. Hardware acceleration of MLP is one most promising way to provide better performance-energy efficiency. Previous works focused on the intra-layer optimization and layer-after-layer processing, while leaving the inter-layer optimization never studied. In this paper, we propose hardware acceleration of MLPs based on inter-layer optimization which allows us to overlap the execution of MLP layers. First we describe the inter-layer optimization from software and mathematical perspectives. Then, a reference Two-Neuron architecture which is efficient to support the inter-layer optimization is proposed and implemented. Discussions about area cost, performance and energy consumption are carried out to explore the scalability of the Two-Neuron architecture. Results show that the proposed MLP design optimized across layers achieves better performance and energy efficiency than the conventional intra-layer optimized designs. As such, the inter-layer optimization provides another possible direction other than the intra-layer optimization to gain further performance and energy improvements for the hardware acceleration of MLPs.
  •  
30.
  • Chen, Xiaowen, et al. (författare)
  • A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition
  • 2018
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 1063-8210 .- 1557-9999. ; 26:10, s. 1953-1966
  • Tidskriftsartikel (refereegranskat)abstract
    • Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain of digital signal processing, and the FFT sizes of different applications are very different. Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range from 2 to 220 points. First, a parallel Cooley-Tukey FFT algorithm based on matrix transposition (MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is designed, and several FFT performance optimization techniques such as hybrid twiddle factor generation, multibank data memory, block MT, and token-based task scheduling are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of 2.4 mm(2) and the power consumption of 91.3 mW at 25 degrees C, 0.9 V. Finally, several experiments are carried out to evaluate the proposal's performance in terms of FFT execution time, resource utilization, and power consumption. Comparative experiments show that our FFT hardware accelerator achieves at most 18.89x speedups in comparison to two software-only solutions and two hardware-dedicated solutions.
  •  
31.
  • Chen, Xiaowen, et al. (författare)
  • Area and Performance Optimization of Barrier Synchronization on Multi-core Network-on-Chips
  • 2010
  • Ingår i: 3rd IEEE International Conference on Computer and Electrical Engineering (ICCEE).
  • Konferensbidrag (refereegranskat)abstract
    • Barrier synchronization is commonly and widelyused to synchronize the execution of parallel processor coreson multi-core Network-on-Chips (NoCs). Since its globalnature may cause heavy serialization resulting in largeperformance penalty, barrier synchronization should becarefully designed to have low latency communication and tominimize overall completion time. Therefore, in the paper, wepropose a fast barrier synchronization mechanism, targetingMulti-core NoCs. The fast barrier synchronization mechanismincludes a dedicated hardware module, named Fast BarrierSynchronizer (FBS), integrated with each processor node. Itoffers a set of barrier counters and can concurrently processsynchronization requests issued by the local node and remotenodes via the on-chip network. The salient feature of our fastbarrier synchronization mechanism is that, once the barriercondition is reached, the “barrier release” acknowledgement isrouted to all processor nodes in a broadcast way in order tosave chip area by avoiding storing source node informationand to minimize completion time by avoiding serialization ofbarrier releasing. Synthesis results suggest that the FBS canrun over 1 GHz in SMIC® 130nm technology with small areaoverhead. We implemented a FBS-enhanced multi-core NoCarchitecture on our FPGA platform using the Xilinx® Virtex 5as the FPGA chip. FPGA utilization and simulation resultsshow that our fast barrier synchronization demonstrates botharea and performance advantages over the barriersynchronization counterpart with unicast barrier releasing.
  •  
32.
  • Chen, Xiaowen, et al. (författare)
  • Cooperative communication based barrier synchronization in on-chip mesh architectures
  • 2011
  • Ingår i: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 8:22, s. 1856-1862
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose cooperative communication as a means to enable efficient and scalable barrier synchronization on mesh-based many-core architectures. Our approach is different from but orthogonal to conventional algorithm-based optimizations. It relies on collaborating routers to provide efficient gather and multicast communication. In conjunction with a master-slave algorithm, it exploits the mesh regularity to achieve efficiency. The gather and multicast functions have been implemented in our router. Synthesis results suggest marginal area overhead. With synthetic and benchmark experiments, we show that our approach significantly reduces synchronization completion time and increases speedup.
  •  
33.
  • Chen, Xiaowen, et al. (författare)
  • Cooperative communication for efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs
  • 2014
  • Ingår i: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 11:18, s. 20140542-
  • Tidskriftsartikel (refereegranskat)abstract
    • On many-core Network-on-Chips (NoCs), communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. Different from conventional algorithm-based approaches, the paper addresses the barrier synchronization problem from the angle of optimizing its communication performance and proposes cooperative communication as a means to achieve efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs. With the cooperative communication, routers collaborate with one another to accomplish a fast barrier synchronization task. The cooperative communication is implemented in our router at low cost. Through comparative experiments, our approach evidently exhibits high efficiency and good scalability.
  •  
34.
  • Chen, Xiaowen, 1982- (författare)
  • Efficient Memory Access and Synchronization in NoC-based Many-core Processors
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In NoC-based many-core processors, memory subsystem and synchronization mechanism are always the two important design aspects, since mining parallelism and pursuing higher performance require not only optimized memory management but also efficient synchronization mechanism. Therefore, we are motivated to research on efficient memory access and synchronization in three topics, namely, efficient on-chip memory organization, fair shared memory access, and efficient many-core synchronization.One major way of optimizing the memory performance is constructing a suitable and efficient memory organization. A distributed memory organization is more suitable to NoC-based many-core processors, since it features good scalability. We envision that it is essential to support Distributed Shared Memory (DSM) because of the huge amount of legacy code and easy programming. Therefore, we first adopt the microcoded approach to address DSM issues, aiming for hardware performance but maintaining the flexibility of programs. Second, we further optimize the DSM performance by reducing the virtual-to-physical address translation overhead. In addition to the general-purpose memory organization such as DSM, there exists special-purpose memory organization to optimize the performance of application-specific memory access. We choose Fast Fourier Transform (FFT) as the target application, and propose a multi-bank data memory specialized for FFT computation.In 3D NoC-based many-core processors, because processor cores and memories reside in different locations (center, corner, edge, etc.) of different layers, memory accesses behave differently due to their different communication distances. As the network size increases, the communication distance difference of memory accesses becomes larger, resulting in unfair memory access performance among different processor cores. This unfair memory access phenomenon may lead to high latencies of some memory accesses, thus negatively affecting the overall system performance. Therefore, we are motivated to study on-chip memory and DRAM access fairness in 3D NoC-based many-core processors through narrowing the round-trip latency difference of memory accesses as well as reducing the maximum memory access latency.Barrier synchronization is used to synchronize the execution of parallel processor cores. Conventional barrier synchronization approaches such as master-slave, all-to-all, tree-based, and butterfly are algorithm oriented. As many processor cores are networked on a single chip, contended synchronization requests may cause large performance penalty. Motivated by this, different from the algorithm-based approaches, we choose another direction (i.e., exploiting efficient communication) to address the barrier synchronization problem. We propose cooperative communication as a means and combine it with the master-slave algorithm and the all-to-all algorithm to achieve efficient many-core barrier synchronization. Besides, a multi-FPGA implementation case study of fast many-core barrier synchronization is conducted.
  •  
35.
  • Chen, Xiaowen, et al. (författare)
  • Handling Shared Variable Synchronization in Multi-core Network-on-Chips with Distributed Memory
  • 2010
  • Ingår i: Proceedings. - 9781424466832 ; , s. 467-472
  • Konferensbidrag (refereegranskat)abstract
    • Parallelized shared variable applications running on multi-core Network-on-Chips(NoCs) require efficient support for synchronization, since communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. In this paper, we propose a dedicated hardware module forsynchronization management. This module is called Synchronization Handler (SH), integrated with each processor-memory node on the multi-core NoCs. It uses two physical buffers to concurrently process synchronization requests issued by the local processor and remote processors via the on-chip network. One salient feature is that the two physical buffers are dynamically allocated to form multiple virtual buffers (a virtual buffer is related to a shared synchronization variable) so as to improve the buffer utilization and alleviate the head-of-line blocking. Synthesis results suggest that the SH can run over 900 MHz in 130nm technology with small area overhead. To justify the SH-enhanced multicore NoCs, we employ synthetic workloads to evaluate synchronizationcost and buffer utilization, and run synchronization-intensive applications to investigate speedup. The results show that our approach is viable.
  •  
36.
  • Chen, Xiaowen, et al. (författare)
  • Hybrid distributed shared memory space in multi-core processors
  • 2011
  • Ingår i: Journal of Software. - : International Academy Publishing (IAP). - 1796-217X. ; 6:12 SPEC. ISSUE, s. 2369-2378
  • Tidskriftsartikel (refereegranskat)abstract
    • On multi-core processors, memories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memory addresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtualto- Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. The hybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressing on shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-time partitioning of hybrid DSM organization in order to analyze its performance. A real DSM based multi-core platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioning demonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improvement depends on problem size, way of data partitioning and computation/communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.
  •  
37.
  • Chen, Xiaowen, et al. (författare)
  • Multi-bit Transient Fault Control for NoC Links Using 2D Fault Coding Method
  • 2016
  • Ingår i: 2016 TENTH IEEE/ACM INTERNATIONAL SYMPOSIUM ON NETWORKS-ON-CHIP (NOCS). - : IEEE. - 9781467390309
  • Konferensbidrag (refereegranskat)abstract
    • In deep nanometer scale, Network-on-Chip (NoC) links are more prone to multi-bit transient fault. Conventional ECC techniques brings heavy area, power, and timing overheads when correcting and detecting multiple transient faults. Therefore, a cost-effective ECC technique, named 2D fault coding method, is adopted to overcome the multi-bit transient fault issue of NoC links. Its key innovation is that the wires of a link are treated as its matrix appearance and light-weight Parity Check Coding (PCC) is performed on the matrix's two dimensions (horizontal matrix rows and vertical matrix columns). Horizontal PCCs and vertical PCCs work together to find the faults' position and then correct them by simply inverting them. The procedure of using the 2D fault coding method to protect a NoC link is proposed, its correction and detection capability is analyzed, and its hardware implementation is carried out. Comparative experiments show that the proposal can largely reduce the ECC hardware cost, have much higher fault detection coverage, maintain almost zero silent fault percentages, and have higher fault correction percentages normalized under the same area, demonstrating that it is cost-effective and suitable to the multi-bit transient fault control for NoC links.
  •  
38.
  • Chen, Xiaowen, et al. (författare)
  • Multi-FPGA Implementation of a Network-on-Chip Based Many-core Architecture with Fast Barrier Synchronization Mechanism
  • 2010
  • Ingår i: Proceedings of the IEEE Norchip Conference. - 9781424489732
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we propose a fast barrier synchronization mechanism, targetingNetwork-on-Chip based manycore architectures. Its salient feature is that, once thebarrier condition is reached, the "barrier release" acknowledgement is routed to all processor nodes in a broadcast way in order to save area by avoiding storing source node information and to minimize completion time by eliminating serialization of barrierreleasing. Then, we construct a multi-FPGA platform using Xilinx® Virtex 5 as FPGA chipsand implement a NoC based many-core architecture on it. FPGA utilization and simulation results show that our mechanism demonstrates both area and performance advantages over the barrier synchronization counterpart with unicast barrier releasing. 
  •  
39.
  • Chen, Xiaowen, et al. (författare)
  • Performance analysis of homogeneous on-chip large-scale parallel computing architectures for data-parallel applications
  • 2015
  • Ingår i: Journal of Electrical and Computer Engineering. - : Hindawi Limited. - 2090-0147 .- 2090-0155. ; 2015
  • Tidskriftsartikel (refereegranskat)abstract
    • On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniform and hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance model's analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.
  •  
40.
  • Chen, Xiaowen, et al. (författare)
  • Reducing Virtual-to-Physical address translation overhead in Distributed Shared Memory based multi-core Network-on-Chips according to data property
  • 2013
  • Ingår i: Computers & electrical engineering. - : Elsevier BV. - 0045-7906 .- 1879-0755. ; 39:2, s. 596-612
  • Tidskriftsartikel (refereegranskat)abstract
    • In Network-on-Chip (NoC) based multi-core platforms, Distributed Shared Memory (DSM) preferably uses virtual addressing in order to hide the physical locations of the memories. However, this incurs performance penalty due to the Virtual-to-Physical (V2P) address translation overhead for all memory accesses. Based on the data property which can be either private or shared, this paper proposes a hybrid DSM which partitions a local memory into a private and a shared part. The private part is accessed directly using physical addressing and the shared part using virtual addressing. In particular, the partitioning boundary can be configured statically at design time and dynamically at runtime. The dynamic configuration further removes the V2P address translation overhead for those data with changeable property when they become private at runtime. In the experiments with three applications (matrix multiplication, 2D FFT, and H.264/AVC encoding), compared with the conventional DSM, our techniques show performance improvement up to 37.89%.
  •  
41.
  • Chen, Xiaowen, et al. (författare)
  • Round-trip DRAM access fairness in 3D NoC-based many-core systems
  • 2017
  • Ingår i: ACM Transactions on Embedded Computing Systems. - : Association for Computing Machinery. - 1539-9087 .- 1558-3465. ; 16:5s
  • Tidskriftsartikel (refereegranskat)abstract
    • In 3D NoC-based many-core systems, DRAM accesses behave differently due to their different communication distances and the latency gap of different DRAM accesses becomes bigger as the network size increases, which leads to unfair DRAM access performance among different nodes. This phenomenon may lead to high latencies for some DRAM accesses that become the performance bottleneck of the system. The paper addresses the DRAM access fairness problem in 3D NoC-based many-core systems by narrowing the latency difference of DRAM accesses as well as reducing the maximum latency. Firstly, the latency of a round-trip DRAM access is modeled and the factors causing DRAM access latency difference are discussed in detail. Secondly, the DRAM access fairness is further quantitatively analyzed through experiments. Thirdly, we propose to predict the network latency of round-trip DRAM accesses and use the predicted round-trip DRAM access time as the basis to prioritize the DRAM accesses in DRAM interfaces so that the DRAM accesses with potential high latencies can be transferred as early and fast as possible, thus achieving fair DRAM access. Experiments with synthetic and application workloads validate that our approach can achieve fair DRAM access and outperform the traditional First-Come-First-Serve (FCFS) scheduling policy and the scheduling policies proposed by reference [7] and [24] in terms of maximum latency, Latency Standard Deviation (LSD)1 and speedup. In the experiments, the maximum improvement of the maximum latency, LSD, and speedup are 12.8%, 6.57%, and 8.3% respectively. Besides, our proposal brings very small extra hardware overhead (<0.6%) in comparison to the three counterparts.
  •  
42.
  • Chen, Xiaowen, 1982-, et al. (författare)
  • Run-time Partitioning of Hybrid Distributed Shared Memory on Multi-core Network-on-Chips
  • 2010
  • Ingår i: The 3rd IEEE International Symposium on Parallel Architectures, Algorithms and Programming (PAAP 2010). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 39-46
  • Konferensbidrag (refereegranskat)abstract
    • On multi-core Network-on-Chips (NoCs), mem- ories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memoryaddresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtual-to-Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. Thehybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressingon shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-timepartitioning of hybrid DSM organization in order to analyze its perfor- mance. A real DSM based multi-core NoC platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioningdemonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improve- ment depends on problem size, way of datapartitioning and computation/ communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.
  •  
43.
  • Chen, Xiaowen, et al. (författare)
  • Speedup Analysis of Data-parallel Applications on Multi-core NoCs
  • 2009
  • Ingår i: Proceedings of the IEEE International Conference on ASIC (ASICON). - 9781424438686 ; , s. 105-108
  • Konferensbidrag (refereegranskat)abstract
    • As more computing cores are integrated onto a single chip, the effect of network communication latency is becoming more and more significant on Multi-core Network-onChips (NoCs). For data-parallel applications, we study the model ofparallel speedup by including network communication latency in Amdahl's law. The speedup analysis considers the effect of network topology, network size, traffic model and computation/communication ratio. We also study the speedup efficiency. In our Multi-core NoC platform, a real data-parallel application, i.e. matrix multiplication, is used to validate the analysis. Our theoretical analysis and the application results show that the speedup improvement is nonlinear and the speedup efficiency decreases as the system size is scaled up. Such analysis can be used to guide architects and programmers to improve parallel processing efficiency by reducing network latency with optimized network design and increasing computation proportion in the program.
  •  
44.
  • Chen, Xiaowen, et al. (författare)
  • Supporting Distributed Shared Memory on Multi-core Network-on-Chips Using a Dual Microcoded Controller
  • 2010
  • Ingår i: Proceedings of the conference for Design Automation and Test in Europe. ; , s. 39-44
  • Konferensbidrag (refereegranskat)abstract
    • Supporting Distributed Shared Memory (DSM) is essential for multi-coreNetwork-on-Chips for the sake of reusing huge amount of legacy code and easy programmability. We propose a microcoded controller as a hardware module in each node to connect the core, the local memory and the network. The controller is programmable where the DSM functions such as virtual-to-physical address translation,memory access and synchronization etc. are realized using microcode. To enable concurrent processing of memory requests from the local and remote cores, ourcontroller features two mini-processors, one dealing with requests from the local coreand the other from remote cores. Synthesis results suggest that the controller consumes 51k gates for the logic and can run up to 455 MHz in 130 nm technology. To evaluate its performance, we use synthetic and application workloads. Results show that, when the system size is scaled up, the delay overhead incurred by the controller may become less significant when compared with the network delay. In this way, the delay efficiency of our DSM solution is close to hardware solutions on average but still have all the flexibility of software solutions.
  •  
45.
  • Chen, Xiaowen, 1982-, et al. (författare)
  • Supporting Efficient Synchronization in Multi-core NoCs Using Dynamic Buffer Allocation Technique
  • 2010
  • Ingår i: Proceedings of the IEEE Annual Symposium on VLSI. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 462-463
  • Konferensbidrag (refereegranskat)abstract
    • This paper explores a dynamic buffer allocation technique to guide a distributedsynchronization architecture to support efficient synchronization on multi-core Network-on-Chips (NoCs). The synchronization architecture features two physical buffers to be able to concurrently queue and handle synchronization requests issued by the local processor and remote processors via the on-chip network. Using the dynamic bufferallocation technique, the two physical buffers are dynamically allocated to form multiple virtual buffers in order to improve buffers' utilization. Experiments are carried on to evaluate buffers' utilization.
  •  
46.
  • Chen, Y., et al. (författare)
  • A deadlock-free fault-tolerant routing algorithm based on pseudo-receiving mechanism for networks-on-chip of CMP
  • 2011
  • Ingår i: 2011 International Conference on Multimedia Technology, ICMT 2011. - 9781612847719 ; , s. 2825-2828
  • Konferensbidrag (refereegranskat)abstract
    • As the size of CMOS technology scales down to nanometers domain, fault-tolerant is becoming a challenge for NoC. Turn model provides a simple and efficient systematic approach to the development of deadlock-free routing algorithms. In this paper, we propose a pseudo-receiving mechanism based on the support of local processor's cache to enable prohibited turn, and meanwhile make it keep deadlockfree. We present a fault-tolerant routing algorithm based on pseudo-receiving mechanism for 2D mesh. The routing algorithm is livelock-free in the cost of disable a few un-faulty links or nodes. The algorithm is applied to a single-cycle fixed output-buffered router. Experimental results show that, it achieves high performance even under high faulty rate.
  •  
47.
  • Chen, Yancang, et al. (författare)
  • A single-cycle output buffered router with layered switching for Networks-on-Chips
  • 2012
  • Ingår i: Computers & electrical engineering. - : Elsevier BV. - 0045-7906 .- 1879-0755. ; 38:4, s. 906-916
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a single-cycle output buffered router based on layered switching for networks on chips (NoCs). Different from state-of-the-art NoC routers, the router has three important characteristics: (1) It employs layered switching, which implements wormhole on top of virtual cut-through (VCT) switching; (2) In contrast to input buffered architectures, it adopts an output buffered architecture; (3) It is single cycle, meaning that the router pipeline takes only one cycle for all flits. Experimental results show that the router achieves up to 80% of ideal network throughput under uniform random traffic pattern. Compared with wormhole switching, layered switching achieves up to 36.9% latency reduction for 12-flit packets under uniform random traffic with an injection rate of 0.5 flit/cycle/node. Under 65 nm technology synthesized results show that its critical path has only 20 logic gates, and it reduces 11% area compared to the input virtual-channel router with the same buffer capacity.
  •  
48.
  • Chen, Yancang, et al. (författare)
  • A Trace-driven Hardware-level Simulator for Design and Verification of Network-on-Chips
  • 2010
  • Ingår i: 2011 INTERNATIONAL CONFERENCE ON COMPUTERS, COMMUNICATIONS, CONTROL AND AUTOMATION (CCCA 2011), VOL II. - : IEEE. ; , s. 32-35
  • Konferensbidrag (refereegranskat)abstract
    • Traditional communications of general-purpose multi-core processor and application-specific System-on-Chip face challenges in terms of scalability and complexity. Network-on-Chip (NoC) has been the most promising solution for the communications of multi-core and many-core chips. In this paper, we present a trace-driven hardware-level simulator (noted HS) based on SystemVerilog for the design and verification of NoCs. Different from the state-of-the-art NoC simulators, the HS owns three important characteristics in addition to the capability of creating simulation and synthesizable NoC descriptions: 1) hardware-level simulation can be done, which means more implementation details of hardware than flit-level simulation; 2) router debugging and verification can be done at RTL by inserting assertions and coverage; 3) trace-based application simulations can be done besides synthetic workloads. A 4 X 4 2D mesh NoC with output virtual-channel routers verifies the capability of our HS.
  •  
49.
  • Chen, Yizhi, 1995-, et al. (författare)
  • Accelerating Non-Negative Matrix Factorization on Embedded FPGA with Hybrid Logarithmic Dot-Product Approximation
  • 2022
  • Ingår i: Proceedings. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 239-246
  • Konferensbidrag (refereegranskat)abstract
    • Non-negative matrix factorization (NMF) is an ef-fective method for dimensionality reduction and sparse decom-position. This method has been of great interest to the scien-tific community in applications including signal processing, data mining, compression, and pattern recognition. However, NMF implies elevated computational costs in terms of performance and energy consumption, which is inadequate for embedded applications. To overcome this limitation, we implement the vector dot-product with hybrid logarithmic approximation as a hardware optimization approach. This technique accelerates floating-point computation, reduces energy consumption, and preserves accuracy. To demonstrate our approach, we employ a design exploration flow using high-level synthesis on an embedded FPGA. Compared with software solutions on ARM CPU, this hardware implementation accelerates the overall computation to decompose matrix by 5.597 × and reduces energy consumption by 69.323×. Log approximation NMF combined with KNN(k-nearest neighbors) has only 2.38% decreasing accuracy compared with the result of KNN processing the matrix after floating-point NMF on MNIST. Further on, compared with a dedicated floating-point accelerator, the logarithmic approximation approach achieves 3.718× acceleration and 8.345× energy reduction. Compared with the fixed-point approach, our approach has an accuracy degradation of 1.93% on MNIST and an accuracy amelioration of 28.2% on the FASHION MNIST data set without pre-knowledge of the data range. Thus, our approach has better compatibility with the input data range.
  •  
50.
  • Chen, Yizhi, 1995-, et al. (författare)
  • Online Image Sensor Fault Detection for Autonomous Vehicles
  • 2022
  • Ingår i: Proceedings. - : Institute of Electrical and Electronics Engineers Inc.. ; , s. 120-127
  • Konferensbidrag (refereegranskat)abstract
    • Automated driving vehicles have shown glorious potential in the near future market due to the high safety and convenience for drivers and passengers. Image sensors' reliability attract many researchers' interests as many image sensors are used in autonomous vehicles. We propose an online image sensor fault detection method based on comparing the historical variances of normal pixels and defective pixels to detect faults. For fault pixels without uncertainty, with a detecting window of more than 30 frames, we get 100% accuracy and 100% recall on realistic continuous traffic pictures from the KITTI data set. We also explore the influence of fault pixel values' uncertainty from 0% to 25% and study different fixed thresholds and a dynamic threshold for judgments. Strict threshold, which is 0.1, has a high accuracy (99.16%) but has a low recall (34.46%) for 15% uncertainty. Loose threshold, which is 0.3, has a relatively high recall (83.78%) but mistakes too many normal pixels with 18.17% accuracy for 15% uncertainty. Our dynamic threshold balances the accuracy and recall. It gets 100% accuracy and 58.69% recall for 5% uncertainty and 78.38% accuracy and 55.39% recall for 15% uncertainty. Based on the detected damage pixel rate, we develop a health score for evaluating the image sensor system intuitively. It can also be helpful for making decision about replacing cameras.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-50 av 294

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy