SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Lu Zhonghai) srt2:(2015-2019)"

Sökning: WFRF:(Lu Zhonghai) > (2015-2019)

  • Resultat 1-50 av 84
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Chen, X., et al. (författare)
  • Achieving memory access equalization via round-trip routing latency prediction in 3D many-core NoCs
  • 2015
  • Ingår i: Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI. - : IEEE. ; , s. 398-403
  • Konferensbidrag (refereegranskat)abstract
    • 3D many-core NoCs are emerging architectures for future high-performance single chips due to its integration of many processor cores and memories by stacking multiple layers. In such architecture, because processor cores and memories reside in different locations (center, corner, edge, etc.), memory accesses behave differently due to their different communication distances, and the performance (latency) gap of different memory accesses becomes larger as the network size is scaled up. This phenomenon may lead to very high latencies suffered from by some memory accesses, thus degrading the system performance. To achieve high performance, it is crucial to reduce the number of memory accesses with very high latencies. However, this should be done with care since shortening the latency of one memory access can worsen the latency of another as a result of shared network resources. Therefore, the goal should focus on narrowing the latency difference of memory accesses. In the paper, we address the goal by proposing to prioritize the memory access packets based on predicting the round-trip routing latencies of memory accesses. The communication distance and the number of the occupied items in the buffers in the remaining routing path are used to predict the round-trip latency of a memory access. The predicted round-trip routing latency is used as the base to arbitrate the memory access packets so that the memory access with potential high latency can be transferred as early and fast as possible, thus equalizing the memory access latencies as much as possible. Experiments with varied network sizes and packet injection rates prove that our approach can achieve the goal of memory access equalization and outperforms the classic round-robin arbitration in terms of maximum latency, average latency, and LSD1. In the experiments, the maximum improvement of the maximum latency, the average latency and the LSD are 80%, 14%, and 45% respectively.
  •  
2.
  • Liu, Weihua, et al. (författare)
  • Characterizing the Reliability and Threshold Voltage Shifting of 3D Charge Trap NAND Flash
  • 2019
  • Ingår i: 2019 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE). - : IEEE. - 9783981926323 ; , s. 312-315
  • Konferensbidrag (refereegranskat)abstract
    • 3D charge trap (CT) triple-level cell (TLC) NAND flash gradually becomes a mainstream storage component due to high storage capacity and performance, but introducing a concern about reliability. Fault tolerance and data management schemes are capable of improving reliability. Designing a more efficient solution, however, needs to understand the reliability characteristics of 3D CT TLC NAND flash. To facilitate such understanding, by exploiting a real-world testing platform, we investigate the reliability characteristics including the raw bit error rate (RBER) and the threshold voltage (Vth) shifting features after suffering from variable disturbances. We give analyses of why these characteristics exist in 3D CT TLC NAND flash. We hope these observations can guide the designers to propose high efficient solutions to the reliability problem.
  •  
3.
  • Badawi, Mohammad, 1981- (författare)
  • Adaptive Coarse-grain Reconfigurable Protocol Processing Architecture
  • 2016
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Digital signal processors and their variants have provided significant benefit to efficient implementation of Physical Layer (PHY) of Open Systems Interconnection (OSI) model’s seven-layer protocol processing stack compared to the general purpose processors. Protocol processors promise to provide a similar advantage for implementing higher layers in the (OSI)'s seven-layer model. This thesis addresses the problem of designing customizable coarse-grain reconfigurable protocol processing fabrics as a solution to achieving high performance and computational efficiency. A key requirement that this thesis addresses is the ability to not only adapt to varying applications and standards, and different modes in each standard but also to time varying load and performance demands while maintaining quality of service.This thesis presents a tile-based multicore protocol processing architecture that can be customized at design time to meet the requirements of the target application. The architecture can then be reconfigured at boot time and tuned to suit the desired use-case. This architecture includes a packet-oriented memory system that has deterministic access time and access energy costs, and hence can be accurately dimensioned to fulfill the requirements of the desired use-case. Moreover, to maintain quality of service as predicted, while minimizing the use of energy and resources, this architecture encompasses an elastic management scheme that controls run-time configuration to deploy processing resources based on use-case and traffic demands.To evaluate the architecture presented in this thesis, different case studies were conducted while quantitative and qualitative metrics were used for assessment. Energy-delay product, energy efficiency, area efficiency and throughput show the improvements that were achieved using the processing cores and the memory of the presented architecture, compared with other solutions. Furthermore, the results show the reduction in latency and power consumption required to evaluate controlling states when using the elastic management scheme. The elasticity of the scheme also resulted in reducing the total area required for the controllers that serve multiple processing cores in comparison with other designs. Finally, the results validate the ability of the presented architecture to support quality of service without misutilizing available energy during a real-life case study of a multi-participant Voice Over Internet Protocol (VOIP) call.
  •  
4.
  • Badawi, Mohammad, et al. (författare)
  • Elastic Management and QoS Provisioning Scheme for Adaptable Multi-core Protocol Processing Architecture
  • 2016
  • Ingår i: 19TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD 2016). - : IEEE. - 9781509028160 ; , s. 575-583
  • Konferensbidrag (refereegranskat)abstract
    • Adaptable protocol processing architectures can offer quality-of-service (QoS) while improving energy efficiency and resource utilization. However, a key condition for adaptable architectures to support QoS is that, the latency required for processor adaptation does not result in violating packet processing delay bound. Moreover, adaptation latency must not cause packets to accumulate until memory becomes full and packets are dropped. In this paper, we present an elastic management scheme for agile adaptable multi-core protocol processing architecture to facilitate processor adaptation when QoS has to be maintained. The proposed management scheme encompasses a set of reconfigurable finite state machines (FSMs) and each is dimensioned to associate single processing element (PE). During processor adaptation, the needed FSMs can rapidly be clustered to provide the control needed for the newly adapted structure. We use a real-life application to demonstrate how our proposed management scheme supports maintaining QoS during processor adaptation. We also quantify the time needed for processor adaptation as well as the reduction in energy, latency and area achieved when using our scheme.
  •  
5.
  • Badawi, Mohammad, et al. (författare)
  • Quality-of-service-aware adaptation scheme for multi-core protocol processing architecture
  • 2017
  • Ingår i: Microprocessors and microsystems. - : Elsevier. - 0141-9331 .- 1872-9436. ; 54, s. 47-59
  • Tidskriftsartikel (refereegranskat)abstract
    • Employing adaptable protocol processing architectures has shown a high potential in provisioning Quality-of-Service (QoS) while retaining efficient use of available energy budget. Nevertheless, successful QoS provisioning using adaptable protocol processing architectures requires adaption to be agile and to have low latency. That is, a long adaptation latency might lead to violating desired packet processing latency, desired throughput or loss of packets if the memory fails to accommodate packet accumulation. This paper presents an elastic management scheme to permit agile and QoS-aware adaptation of processing elements (PEs) within the protocol processing architecture, such that desired QoS is maintained. Moreover, our proposed scheme has the potential to reduce energy consumption since it employs the PEs upon demand. We quantify the latency required for PEs adaptation, the reduction in energy and the reduction in area that can be achieved using our scheme. We also consider two different real-life use cases to demonstrate the effectiveness of our proposed management scheme in maintaining QoS while conserving available energy.
  •  
6.
  • Badawi, Mohammad, et al. (författare)
  • Service-Guaranteed Multi-Port PacketMemory for Parallel Protocol Processing Architecture
  • 2016
  • Ingår i: Proceedings - 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2016. - : Institute of Electrical and Electronics Engineers (IEEE). - 9781467387750 ; , s. 408-412
  • Konferensbidrag (refereegranskat)abstract
    • Parallel processing architectures have been increasingly utilized due to their potential for improving performance and energy efficiency. Unfortunately, the anticipated improvement often suffers from a limitation caused by memory access latency and latency variation, which consequently impact Quality of Service (QoS). This paper presents a service-guaranteed multi-port packet memory system to boost parallelism in protocol processing architectures. In this proposed memory system, all arriving packets are guaranteed a memory space, such that, a packet memory space can be allocated in a bounded number of cycles and each of its locations is accessible in a single cycle. We consider a real-time Voice Over Internet Protocol (VOIP) call as a case-study to evaluate our service-guaranteed memory system.
  •  
7.
  • Becker, Matthias, 1986-, et al. (författare)
  • An adaptive resource provisioning scheme for industrial SDN networks
  • 2019
  • Ingår i: IEEE International Conference on Industrial Informatics (INDIN). - : Institute of Electrical and Electronics Engineers Inc.. - 9781728129273 ; , s. 877-880
  • Konferensbidrag (refereegranskat)abstract
    • Many industrial domains face the challenge of ever growing networks, driven for example by Internet-of-Things and Industry 4.0. This typically comes together with increased network configuration and management efforts. In addition to the increasing network size, these domains typically are subject to adaptive load situations that pose an additional challenge on the network infrastructure.Software defined networking (SDN) is a promising networking paradigm that reduces configuration complexity and management effort in Ethernet networks. In this work, we investigate SDN in context of adaptive scenarios with QoS constraints. Our approach applies monitoring of several thresholds which automatically trigger redistribution of resources via the central SDN controller. This setup leads to an agile system that can dynamically react to load changes while the infrastructure is not overprovisioned. The approach is implemented in a low-level simulation environment where we demonstrate the benefits of the approach using a case study.
  •  
8.
  • Becker, Matthias, 1986-, et al. (författare)
  • Towards QoS-Aware Service-Oriented Communication in E/E Automotive Architectures
  • 2018
  • Ingår i: Proceedings of the 44th Annual Conference of the IEEE Industrial Electronics Society (IECON). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 4096-4101
  • Konferensbidrag (refereegranskat)abstract
    • With the raise of increasingly advanced driving assistance systems in modern cars, execution platforms that build on the principle of service-oriented architectures are being proposed. Alongside, service oriented communication is used to provide the required adaptive communication infrastructure on top of automotive Ethernet networks. A middleware is proposed that enables QoS aware service-oriented communication between software components, where the prescribed behavior of each software component is defined by Assume/Guarantee (A-G) contracts. To enable the use of COTS components, that are often not sufficiently verified for the use in automotive systems, the middleware monitors the communication behavior of components and verifies it against the components A/G contract. A violation of the allowed communication behavior then triggers adaption processes in the system while the impact on other communication is minimized. The applicability of the approach is demonstrated by a case study that utilizes a prototype implementation of the proposed approach.
  •  
9.
  • Chen, Dejiu, et al. (författare)
  • A methodological framework for model-based self-management of services and components in dependable cyber-physical systems
  • 2017
  • Ingår i: 12th International Conference on Dependability and Complex Systems, DepCoS-RELCOMEX 2017. - Cham : Springer. - 9783319594149 ; , s. 97-105
  • Konferensbidrag (refereegranskat)abstract
    • Modern automotive vehicles featuring ADAS (Advanced Driving Assistant Systems) and AD (Autonomous Driving) represent one category of dependable CPS (Cyber-Physical Systems). For such systems, the adaptation of generic purpose COTS (Commercial-Off-The-Shelf) services and components has been advocated in the industry as a necessary means for shortening the innovation loops and enabling efficient product evolution. This will however not be a trivial task due to the system safety- and time-criticality. This calls on one hand for formal specification of systems, and on the other hand for a systematic approach to module design, supervision and adaptions. Accordingly, we propose in this paper a novel method that emphasizes an integration of system models, formal contracts, and embedded services for effective self-management of COTS. The key modeling technologies include the EAST-ADL for formal system description and the A-G contract theory for module specification.
  •  
10.
  • Chen, DeJiu, et al. (författare)
  • IMBSA 2017: Model-Based Safety and Assessment
  • 2017
  • Ingår i: Model-Based Safety and Assessment - 5th International Symposium, Trento, Italy, September 11–13, 2017. - Cham : Springer. - 9783319641188 ; , s. 227-240
  • Konferensbidrag (refereegranskat)abstract
    • Modern automotive vehicles represent one category of CPS (Cyber-Physical Systems) that are inherently time- and safety-critical. To justify the actions for quality-of-service adaptation and safety assurance, it is fundamental to perceive the uncertainties of system components in operation, which are caused by emergent properties, design or operation anomalies. From an industrial point of view, a further challenge is related to the usages of generic purpose COTS (Commercial-Off-The-Shelf) components, which are separately developed and evolved, often not sufficiently verified and validated for specific automotive contexts. While introducing additional uncertainties in regard to the overall system performance and safety, the adoption of COTS components constitutes a necessary means for effective product evolution and innovation. Accordingly, we propose in this paper a novel approach that aims to enable advanced operation monitoring and self-assessment in regard to operational uncertainties and thereby automated performance and safety awareness. The emphasis is on the integration of several modeling technologies, including the domain-specific modeling framework EAST-ADL, the A-G contract theory and Hidden Markov Model (HMM). In particular, we also present some initial concepts in regard to the usage performance and safety awareness for quality-of-service adaptation and dynamic risk mitigation.
  •  
11.
  • Chen, Qinyu, et al. (författare)
  • An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks
  • 2019
  • Ingår i: Electronics. - : MDPI. - 2079-9292. ; 8:4
  • Tidskriftsartikel (refereegranskat)abstract
    • Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of . It can achieve TOPS/W energy efficiency and area efficiency at 100.1mW, which makes it a promising design for the embedded devices.
  •  
12.
  • Chen, Qinyu, et al. (författare)
  • Smilodon : An Efficient Accelerator for Low Bit-Width CNNs with Task Partitioning
  • 2019
  • Ingår i: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS). - : IEEE. - 9781728103976
  • Konferensbidrag (refereegranskat)abstract
    • Convolutional Neural Networks (CNNs) have been widely applied in various fields such as image and video recognition, recommender systems, and natural language processing. However, the massive size and intensive computation loads prevent its feasible deployment in practice, especially on the embedded systems. As a highly competitive candidate, low bit-width CNNs are proposed to enable efficient implementation. In this paper, we propose Smilodon, a scalable, efficient accelerator for low bit-width CNNs based on a parallel streaming architecture, optimized with a task partitioning strategy. We also present the 3D systolic-like computing arrays fitting for convolutional layers. Our design is implemented on Zynq XC7ZO20 FPGA, which can satisfy the needs of real-time with a frame rate of 1, 622 FPS throughput, while consuming 2.1 Watt. To the best of our knowledge, our accelerator is superior to the state-of-the-art works in the tradeoff among throughput, power efficiency, and area efficiency.
  •  
13.
  • Chen, S., et al. (författare)
  • Hardware acceleration of multilayer perceptron based on inter-layer optimization
  • 2019
  • Ingår i: Proceedings - 2019 IEEE International Conference on Computer Design, ICCD 2019. - : Institute of Electrical and Electronics Engineers Inc.. - 9781538666487 ; , s. 164-172
  • Konferensbidrag (refereegranskat)abstract
    • Multilayer Perceptron (MLP) is used in a broad range of applications. Hardware acceleration of MLP is one most promising way to provide better performance-energy efficiency. Previous works focused on the intra-layer optimization and layer-after-layer processing, while leaving the inter-layer optimization never studied. In this paper, we propose hardware acceleration of MLPs based on inter-layer optimization which allows us to overlap the execution of MLP layers. First we describe the inter-layer optimization from software and mathematical perspectives. Then, a reference Two-Neuron architecture which is efficient to support the inter-layer optimization is proposed and implemented. Discussions about area cost, performance and energy consumption are carried out to explore the scalability of the Two-Neuron architecture. Results show that the proposed MLP design optimized across layers achieves better performance and energy efficiency than the conventional intra-layer optimized designs. As such, the inter-layer optimization provides another possible direction other than the intra-layer optimization to gain further performance and energy improvements for the hardware acceleration of MLPs.
  •  
14.
  • Chen, Xiaowen, et al. (författare)
  • A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition
  • 2018
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 1063-8210 .- 1557-9999. ; 26:10, s. 1953-1966
  • Tidskriftsartikel (refereegranskat)abstract
    • Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain of digital signal processing, and the FFT sizes of different applications are very different. Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range from 2 to 220 points. First, a parallel Cooley-Tukey FFT algorithm based on matrix transposition (MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is designed, and several FFT performance optimization techniques such as hybrid twiddle factor generation, multibank data memory, block MT, and token-based task scheduling are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of 2.4 mm(2) and the power consumption of 91.3 mW at 25 degrees C, 0.9 V. Finally, several experiments are carried out to evaluate the proposal's performance in terms of FFT execution time, resource utilization, and power consumption. Comparative experiments show that our FFT hardware accelerator achieves at most 18.89x speedups in comparison to two software-only solutions and two hardware-dedicated solutions.
  •  
15.
  • Chen, Xiaowen, 1982- (författare)
  • Efficient Memory Access and Synchronization in NoC-based Many-core Processors
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In NoC-based many-core processors, memory subsystem and synchronization mechanism are always the two important design aspects, since mining parallelism and pursuing higher performance require not only optimized memory management but also efficient synchronization mechanism. Therefore, we are motivated to research on efficient memory access and synchronization in three topics, namely, efficient on-chip memory organization, fair shared memory access, and efficient many-core synchronization.One major way of optimizing the memory performance is constructing a suitable and efficient memory organization. A distributed memory organization is more suitable to NoC-based many-core processors, since it features good scalability. We envision that it is essential to support Distributed Shared Memory (DSM) because of the huge amount of legacy code and easy programming. Therefore, we first adopt the microcoded approach to address DSM issues, aiming for hardware performance but maintaining the flexibility of programs. Second, we further optimize the DSM performance by reducing the virtual-to-physical address translation overhead. In addition to the general-purpose memory organization such as DSM, there exists special-purpose memory organization to optimize the performance of application-specific memory access. We choose Fast Fourier Transform (FFT) as the target application, and propose a multi-bank data memory specialized for FFT computation.In 3D NoC-based many-core processors, because processor cores and memories reside in different locations (center, corner, edge, etc.) of different layers, memory accesses behave differently due to their different communication distances. As the network size increases, the communication distance difference of memory accesses becomes larger, resulting in unfair memory access performance among different processor cores. This unfair memory access phenomenon may lead to high latencies of some memory accesses, thus negatively affecting the overall system performance. Therefore, we are motivated to study on-chip memory and DRAM access fairness in 3D NoC-based many-core processors through narrowing the round-trip latency difference of memory accesses as well as reducing the maximum memory access latency.Barrier synchronization is used to synchronize the execution of parallel processor cores. Conventional barrier synchronization approaches such as master-slave, all-to-all, tree-based, and butterfly are algorithm oriented. As many processor cores are networked on a single chip, contended synchronization requests may cause large performance penalty. Motivated by this, different from the algorithm-based approaches, we choose another direction (i.e., exploiting efficient communication) to address the barrier synchronization problem. We propose cooperative communication as a means and combine it with the master-slave algorithm and the all-to-all algorithm to achieve efficient many-core barrier synchronization. Besides, a multi-FPGA implementation case study of fast many-core barrier synchronization is conducted.
  •  
16.
  • Chen, Xiaowen, et al. (författare)
  • Multi-bit Transient Fault Control for NoC Links Using 2D Fault Coding Method
  • 2016
  • Ingår i: 2016 TENTH IEEE/ACM INTERNATIONAL SYMPOSIUM ON NETWORKS-ON-CHIP (NOCS). - : IEEE. - 9781467390309
  • Konferensbidrag (refereegranskat)abstract
    • In deep nanometer scale, Network-on-Chip (NoC) links are more prone to multi-bit transient fault. Conventional ECC techniques brings heavy area, power, and timing overheads when correcting and detecting multiple transient faults. Therefore, a cost-effective ECC technique, named 2D fault coding method, is adopted to overcome the multi-bit transient fault issue of NoC links. Its key innovation is that the wires of a link are treated as its matrix appearance and light-weight Parity Check Coding (PCC) is performed on the matrix's two dimensions (horizontal matrix rows and vertical matrix columns). Horizontal PCCs and vertical PCCs work together to find the faults' position and then correct them by simply inverting them. The procedure of using the 2D fault coding method to protect a NoC link is proposed, its correction and detection capability is analyzed, and its hardware implementation is carried out. Comparative experiments show that the proposal can largely reduce the ECC hardware cost, have much higher fault detection coverage, maintain almost zero silent fault percentages, and have higher fault correction percentages normalized under the same area, demonstrating that it is cost-effective and suitable to the multi-bit transient fault control for NoC links.
  •  
17.
  • Chen, Xiaowen, et al. (författare)
  • Performance analysis of homogeneous on-chip large-scale parallel computing architectures for data-parallel applications
  • 2015
  • Ingår i: Journal of Electrical and Computer Engineering. - : Hindawi Limited. - 2090-0147 .- 2090-0155. ; 2015
  • Tidskriftsartikel (refereegranskat)abstract
    • On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniform and hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance model's analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.
  •  
18.
  • Chen, Xiaowen, et al. (författare)
  • Round-trip DRAM access fairness in 3D NoC-based many-core systems
  • 2017
  • Ingår i: ACM Transactions on Embedded Computing Systems. - : Association for Computing Machinery. - 1539-9087 .- 1558-3465. ; 16:5s
  • Tidskriftsartikel (refereegranskat)abstract
    • In 3D NoC-based many-core systems, DRAM accesses behave differently due to their different communication distances and the latency gap of different DRAM accesses becomes bigger as the network size increases, which leads to unfair DRAM access performance among different nodes. This phenomenon may lead to high latencies for some DRAM accesses that become the performance bottleneck of the system. The paper addresses the DRAM access fairness problem in 3D NoC-based many-core systems by narrowing the latency difference of DRAM accesses as well as reducing the maximum latency. Firstly, the latency of a round-trip DRAM access is modeled and the factors causing DRAM access latency difference are discussed in detail. Secondly, the DRAM access fairness is further quantitatively analyzed through experiments. Thirdly, we propose to predict the network latency of round-trip DRAM accesses and use the predicted round-trip DRAM access time as the basis to prioritize the DRAM accesses in DRAM interfaces so that the DRAM accesses with potential high latencies can be transferred as early and fast as possible, thus achieving fair DRAM access. Experiments with synthetic and application workloads validate that our approach can achieve fair DRAM access and outperform the traditional First-Come-First-Serve (FCFS) scheduling policy and the scheduling policies proposed by reference [7] and [24] in terms of maximum latency, Latency Standard Deviation (LSD)1 and speedup. In the experiments, the maximum improvement of the maximum latency, LSD, and speedup are 12.8%, 6.57%, and 8.3% respectively. Besides, our proposal brings very small extra hardware overhead (<0.6%) in comparison to the three counterparts.
  •  
19.
  • Chen, Zhe, et al. (författare)
  • Toward FPGA Security in IoT : A New Detection Technique for Hardware Trojans
  • 2019
  • Ingår i: IEEE Internet of Things Journal. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 2327-4662. ; 6:4, s. 7061-7068
  • Tidskriftsartikel (refereegranskat)abstract
    • Nowadays, field programmable gate array (FPGA) has been widely used in Internet of Things (IoT) since it can provide flexible and scalable solutions to various IoT requirements. Meanwhile, hardware Trojan (HT), which may lead to undesired chip function or leak sensitive information, has become a great challenge for FPGA security. Therefore, distinguishing the Trojan-infected FPGAs is quite crucial for reinforcing the security of IoT. To achieve this goal, we propose a clock-tree-concerned technique to detect the HTs on FPGA. First, we present an experimental framework which helps us to collect the electromagnetic (EM) radiation emitted by FPGA clock tree. Then, we propose a Trojan identifying approach which extracts the mathematical feature of obtained EM traces, i.e., 2-D principal component analysis (2DPCA) in this paper, and automatically isolates the Trojan-infected FPGAs from the Trojan-free ones by using a BP neural network. Finally, we perform extensive experiments to evaluate the effectiveness of our method. The results reveal that our approach is valid in detecting HTs on FPGA. Specifically, for the trust-hub benchmarks, we can find out the FPGA with always on Trojans (100% detection rate) while identifying the triggered Trojans with high probability (by up to 92%). In addition, we give a thorough discussion on how the experimental setup, such as probe step size, scanning area, and chip ambient temperature, affects the Trojan detection rate.
  •  
20.
  • Du, G., et al. (författare)
  • NR-MPA : Non-recovery compression based multi-path packet-connected-circuit architecture of convolution neural networks accelerator
  • 2019
  • Ingår i: Proceedings - 2019 IEEE International Conference on Computer Design, ICCD 2019. - : Institute of Electrical and Electronics Engineers (IEEE). - 9781538666487 ; , s. 173-176
  • Konferensbidrag (refereegranskat)abstract
    • Convolution Neural Networks (CNNs) involve massive data to be calculated and stored. To meet the challenges above, parallel hardware accelerators consisting of hundreds of Processing Elements (PEs) arranged as a many-core systemon-chip, connected by a Network-on-Chip (NoC) are proposed, which achieve high throughput exploiting parallel PE array. However, most of existing accelerators focus on only one aspect, such as compute structure of PE and data movement overhead above NoC, which causes the throughout, area and latency of the accelerator not fully optimized. In this paper, we propose an efficient general purpose CNN accelerator including both compute based on Non-Recovery Compression (NRC) method and data movement by novel Multi-Paths Packet Connection Circuit (MP-PCC). NRC can save computation time due to zero multiplier through shift decoding in PE and improve power efficiency by saving a large number of data transmission. MPPCC, evolved from Packet Connection Circuit, supports single and multicast transmission modes at the same time, and changes the multicast (X, Y) routing algorithm to multicast Y algorithm to improve the transmission efficiency. The proposed architecture which was implemented on Xilinx FPGA achieves 17.7x faster computation speed and 2.2x fewer memory accesses compared with the state-of-the-art method.
  •  
21.
  • Du, Gaoming, et al. (författare)
  • OLITS : An Ohm's Law-like Traffic Splitting Model Based on Congestion Prediction
  • 2016
  • Ingår i: PROCEEDINGS OF THE 2016 DESIGN, AUTOMATION &amp; TEST IN EUROPE CONFERENCE &amp; EXHIBITION (DATE). - Singapore : IEEE conference proceedings. - 9783981537079 ; , s. 1000-1005
  • Konferensbidrag (refereegranskat)abstract
    • Through traffic splitting, multi-path routing in Network-on-Chip (NoC) outperforms single-path routing in terms of load balance and resource utilization. However, uncontrolled traffic splitting may aggravate network congestion and worsen the communication delay. We propose an Ohm's Law-like traffic splitting model aiming for application-specific NoC. We first characterize the flow congestion by redefining a contention matrix, which contains flow parameters such as average flow rate and burstiness. We then define flow resistance as the flow congestion factor extracted from the contention matrix, and use the parallel resistance theory to predicate the congestion state for every target sub-flow. Finally, the traffic splitting proportions of the parallel sub-flows are assigned according to the equivalent flow resistance. Experiments are taken both on 2D and 3D multi-path routing NoCs. The results show that the worst-case delay bound of target flow is significantly improved, and network congestion can be effectively balanced.
  •  
22.
  • Du, Gaoming, et al. (författare)
  • SSS : Self-aware System-on-chip Using a Static-dynamic Hybrid Method
  • 2019
  • Ingår i: ACM Journal on Emerging Technologies in Computing Systems. - : ASSOC COMPUTING MACHINERY. - 1550-4832 .- 1550-4840. ; 15:3
  • Tidskriftsartikel (refereegranskat)abstract
    • Network-on-Chip (NoC) has become the de facto communication standard for multi-core or many-core System-on-Chip (SoC) due to its scalability and flexibility. However, an important factor in NoC design is temperature, which affects the overall performance of SoC-decreasing circuit frequency, increasing energy consumption, and even shortening chip lifetime. In this article, we propose SSS, a self-aware SoC using a static-dynamic hybrid method that combines dynamic mapping and static mapping to reduce the hotspot temperature for NoC-based SoCs. First, we propose monitoring and thermal modeling for self-state sensoring. Then, in static mapping stage, we calculate the optimal mapping solutions under different temperature modes using the discrete firefly algorithm to help self-decisionmaking. Finally, in dynamic mapping stage, we achieve dynamic mapping through configuring NoC and SoC sentient units for self-optimizing. Experimental results show that SSS has substantially reduced the peak temperature by up to 37.52%. The FPGA prototype proves the effectiveness and smartness of SSS in reducing hotspot temperature.
  •  
23.
  • Du, G., et al. (författare)
  • Work-in-progress : SSS: Self-aware system-on-chip using static-dynamic hybrid method
  • 2017
  • Ingår i: Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion, CASES 2017. - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450351843
  • Konferensbidrag (refereegranskat)abstract
    • Network on chip has become the de facto communication standard for multi-core or many-core system on chip, due to its scalability and flexibility. However, temperature is an important factor in NoC design, which affects the overall performance of SoC-decreasing circuit frequency, increasing energy consumption, and even shortening chip lifetime. In this paper, we propose SSS, a self-aware SoC using a static-dynamic hybrid method, which combines dynamic mapping and static mapping to reduce the hot-spots temperature for NoC based SoCs. First, we propose monitoring the thermal distribution for self-state sensoring. Then, in static mapping stage, we calculate the optimal mapping solutions under different temperature modes using discrete firefly algorithm to help self-decision making. Finally, in dynamic mapping stage, we achieve dynamic mapping through configuring NoC and SoC sentient unit for selfoptimizing. Experimental results show SSS can reduce the peak temperature by up to 30.64%. FPGA prototype shows the effectiveness and smartness of SSS in reducing hot-spots temperature. Self-awareness, SoC architecture, NoC.
  •  
24.
  •  
25.
  • Feng, Chaochao, et al. (författare)
  • Performance analysis of on-chip bufferless router with multi-ejection ports
  • 2015
  • Ingår i: Proceedings - 2015 IEEE 11th International Conference on ASIC, ASICON 2015. - : IEEE conference proceedings. - 9781479984831
  • Konferensbidrag (refereegranskat)abstract
    • In general, the bufferless NoC router has only one local output port for ejection, which may lead to multiple arriving flits competing for the only one output port. In this paper, we propose a reconfigurable bufferless router in which the number of ejection ports can be configured as 2, 3 and 4. Simulation results demonstrate that the average packet latency of the routers with multi-ejection ports is 18%, 10%, 6%, 14%, 9% and 7% on average less than that of the router with 1 ejection ports under six synthetic workloads respectively. For application workloads, the average packet latency of the router with more than two ejection ports is slightly better than the router with only one ejection port, which can be neglect. Making a compromise of hardware cost and performance, it can be concluded that it is no need to implement bufferless routers with 3 and 4 ejection ports, as the router with 2 ejection ports can achieve almost the same performance as the routers with 3 and 4 ejection ports.
  •  
26.
  • Fu, Yuxiang, et al. (författare)
  • Congestion-Aware Dynamic Elevator Assignment for Partially Connected 3D-NoCs
  • 2019
  • Ingår i: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS). - : IEEE. - 9781728103976
  • Konferensbidrag (refereegranskat)abstract
    • The combination of Network-on-Chips (NoCs) and 3D IC technology, 3D NoCs, has been proven to be able to achieve a great improvement in both network performance and power consumption compared to 2D NoCs. In the traditional 3D NoC, all routers are vertically connected. Due to the large overhead of Through-Silicon-Via (TSV, e.g., low fabrication yield and the occupied silicon area), the partially connected 3D NoC has emerged. The assignment method determines the traffic loads of the vertical links (elevators), thus has a great impact on 3D-NoCs' performance. In this paper, we propose a congestion-aware dynamic elevator assignment (CDA) scheme, which takes both the distance factors and network congestion information into account. Experiments show that the performance of the proposed CDA scheme is improved by 67% to 87% compared to the random selection scheme, 8% to 25% compared to SelByDis-1, and 13% to 18% compared to SelByDis-2.
  •  
27.
  • Guo, Shize, et al. (författare)
  • Security-Aware Task Mapping Reducing Thermal Side Channel Leakage in CMPs
  • 2019
  • Ingår i: IEEE Transactions on Industrial Informatics. - : IEEE. - 1551-3203 .- 1941-0050. ; 15:10, s. 5435-5443
  • Tidskriftsartikel (refereegranskat)abstract
    • Chip multiprocessor (CMP) suffers from growing threats on hardware security in recent years, such as side channel attack, hardware Trojan infection, chip clone, etc. In this paper, we propose a security-aware (SA) task mapping method to reduce the information leakage from CMP thermal side channel. First, we construct a mathematical function that can estimate the CMP security cost corresponding to a given mapping result. Then, we develop a greedy mapping algorithm that automatically allocates all threads of an application to a set of proper cores, such that the total security cost is optimized. Finally, we perform extensive experiments to evaluate our method. The experimental results show that our SA mapping effectively decreases the CMP side channel leakage. Compared to the two existing task mapping methods, Linux scheduler (LS; a standard Linux scheduler) and NoC-Sprinting (NS; a thermal-aware mapping technique), our method reduces side-channel vulnerability factor by up to 19 & x0025; and 7 & x0025;, respectively. Moreover, our method also gains higher computational efficiency, with improvement in million instructions per second achieving up to 100 & x0025; against NS and up to 33 & x0025; against LS.
  •  
28.
  •  
29.
  • Huan, Yuxiang, et al. (författare)
  • A 101.4 GOPS/W Reconfigurable and Scalable Control-Centric Embedded Processor for Domain-Specific Applications
  • 2016
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : IEEE. - 1549-8328 .- 1558-0806. ; 63:12, s. 2245-2256
  • Tidskriftsartikel (refereegranskat)abstract
    • Adapting the processor to the target application is essential in the Internet-of-Things (IoT), and thus requires customizability in order to improve energy efficiency and scalability to provide sufficient performance. In this paper, a reconfigurable and scalable control-centric architecture is proposed, and a processor consisting of two cores and an on-chip multi-mode router is implemented. Reconfigurability is enabled by a programmable sequence mapping table (SMT) which reorganizes functional units in each cycle, thus increasing hardware utilization and reducing excessive data movement for high energy efficiency. The router facilitates both wormhole and circuit switching to construct intra- or inter-chip interconnections, providing scalable performance. Fabricated in a 65-nm process, the chip exhibits 101.4 GOPS/W energy efficiency with a die size of 3.5 mm(2). The processor carries out general-purpose processing with a code size 29% smaller than the ARM Cortex M4, and improves the performance of application-specific processing by over ten times when implementing AES and RSA using SMTs instead of general-purpose C. By utilizing the on-chip router, the processor can be interconnected up to 256 nodes, with a single link bandwidth of 1.4 Gbps.
  •  
30.
  • Jafari, Fahimeh, et al. (författare)
  • Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels
  • 2015
  • Ingår i: ACM Transactions on Design Automation of Electronic Systems. - : Association for Computing Machinery (ACM). - 1084-4309 .- 1557-7309. ; 20:3
  • Tidskriftsartikel (refereegranskat)abstract
    • Real-time applications such as multimedia and gaming require stringent performance guarantees, usually enforced by a tight upper bound on the maximum end-to-end delay. For FIFO multiplexed on-chip packet switched networks we consider worst-case delay bounds for Variable Bit-Rate (VBR) flows with aggregate scheduling, which schedules multiple flows as an aggregate flow. VBR Flows are characterized by a maximum transfer size (L), peak rate (p), burstiness (sigma), and average sustainable rate (rho). Based on network calculus, we present and prove theorems to derive per-flow end-to-end Equivalent Service Curves (ESC), which are in turn used for computing Least Upper Delay Bounds (LUDBs) of individual flows. In a realistic case study we find that the end-to-end delay bound is up to 46.9% more accurate than the case without considering the traffic peak behavior. Likewise, results also show similar improvements for synthetic traffic patterns. The proposed methodology is implemented in C++ and has low run-time complexity, enabling quick evaluation for large and complex SoCs.
  •  
31.
  • Jafari, Fahimeh, et al. (författare)
  • Weighted Round Robin Configuration for Worst-Case Delay Optimization in Network-on-Chip
  • 2016
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : IEEE. - 1063-8210 .- 1557-9999. ; 24:12, s. 3387-3400
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose an approach for computing the end-to-end delay bound of individual variable bit-rate flows in an First Input First Output multiplexer with aggregate scheduling under weighted round robin (WRR) policy. To this end, we use a network calculus to derive per-flow end-to-end equivalent service curves employed for computing least upper delay bounds (LUDBs) of the individual flows. Since the real-time applications are going to meet guaranteed services with lower delay bounds, we optimize the weights in WRR policy to minimize the LUDBs while satisfying the performance constraints. We formulate two constrained delay optimization problems, namely, minimize-delay and multiobjective optimization. Multiobjective optimization has both the total delay bounds and their variance as the minimization objectives. The proposed optimizations are solved using a genetic algorithm. A video object plane decoder case study exhibits a 15.4% reduction of the total worst case delays and a 40.3% reduction on the variance of delays when compared with round robin policy. The optimization algorithm has low run-time complexity, enabling quick exploration of the large design spaces. We conclude that an appropriate weight allocation can be a valuable instrument for the delay optimization in on-chip network designs.
  •  
32.
  • Li, Cunlu, et al. (författare)
  • RoB-Router : A Reorder Buffer Enabled Low Latency Network-on-Chip Router
  • 2018
  • Ingår i: IEEE Transactions on Parallel and Distributed Systems. - : IEEE COMPUTER SOC. - 1045-9219 .- 1558-2183. ; 29:9, s. 2090-2104
  • Tidskriftsartikel (refereegranskat)abstract
    • Traditional input-queued routers in network-on-chips (NoCs) only have a small number of virtual channels (VCs) and packets in a VC are organized in a fixed order. Such design is susceptible to head-of-line (HoL) blocking as only the packet at the head of a VC can be allocated by the switch allocator. Since switch allocation is the critical pipeline stage in on-chip routers, HoL blocking significantly degrades the performance of NoCs. In this paper, we propose to schedule packets in input buffers utilizing reorder buffer (RoB) techniques. We design VCs as RoBs to allow packets located not at the head of a VC to be allocated before the head packets. RoBs reduce the conflicts in switch allocation and mitigate the HoL blocking and thus improve the NoC performance. However, it is hard to reorder all the units in a VC due to circuit complexity and power overhead. We propose RoB-Router, which leverages elastic RoBs in VCs to only allow a part of a VC to act as RoB. RoB-Router automatically determines the length of RoB in a VC based on the number of buffered flits. This design minimizes the resource while achieving excellent efficiency. Furthermore, we propose two independent methods to improve the performance of RoB-Router. One is to optimize the packet order in input buffers by redesigning VC allocation strategy. The other combines RoB-Router with current most efficient switch allocator TS-Router. We perform evaluations and the results show that our design can achieve 46 and 15.7 percent performance improvement in packet latency under synthetic traffic and traces from PARSEC than TS-Router, and the cost of energy and area is moderate. Additionally, average packet latency reduction by our two improving methods under uniform traffic is 13 and 17 percent respectively.
  •  
33.
  • Liu, Shaoteng, et al. (författare)
  • Highway in TDM NoCs
  • 2015
  • Ingår i: Proceedings of the Ninth ACM/IEEE International Symposium on Networks-on-Chip (NoCS'15). - New York, NY, USA : ACM Digital Library. - 9781450333962
  • Konferensbidrag (refereegranskat)abstract
    • TDM (Time Division Multiplexing) is a well-known technique to provide QoS guarantees in NoCs. However, unused time slots commonly exist in TDM NoCs. In the paper, we propose a TDM highway technique which can enhance the slot utilization of TDM NoCs. A TDM highway is an express TDM connection composed of special buffer queues, called highway channels (HWCs). It can enhance the throughput and reduce data transfer delay of the connection, while keeping the quality of service (QoS) guarantee on minimum bandwidth and in-order packet delivery. We have developed a dynamic and repetitive highway setup policy which has no dependency on particular TDM NoC techniques and no overhead on traffic flows. As a result, highways can be efficiently established and utilized in various TDM NoCs.According to our experiments, compared to a traditional TDM NoC, adding one HWC with two buffers to every input port of routers in an 8×8 mesh can reduce data delay by up to 80% and increase the maximum throughput by up to 310%. More improvements can be achieved by adding more HWCs per input per router, or more buffers per HWC. We also use a set of MPSoC application benchmarks to evaluate our highway technique. The experiment results suggest that with highway, we can reduce application run time up to 51%.
  •  
34.
  • Liu, Shaoteng, et al. (författare)
  • MultiCS : Circuit switched NoC with multiple sub-networks and sub-channels
  • 2015
  • Ingår i: Journal of systems architecture. - : Elsevier. - 1383-7621 .- 1873-6165.
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose a multi-channel and multi-network circuit switched NoC (MultiCS) with a probe searching setup method to explore different channel partitioning and configuration policies. Our design has a variable number of channels which can be configured either as sub-channels (spatial division multiplexing channels) or sub-networks. Packets can be delivered on an established connection with one or multiple channels. An adaptive channel allocation scheme, which determines a connection width according to the dynamic use of channels, can greatly reduce the delay, compared to a deterministic allocation scheme. However, the latter can offer exact connection width as requested. The benefits and burden of using different number of channels and configurations are studied by analysis and experiments. Our experimental results show that sub-network configurations are superior to sub-channel configurations in delay and throughput, when working at the highest clock frequency of each configuration. Under reasonable channel partitioning, sub-networks with narrow channels can generally achieve higher throughput than the network using single wide channels.
  •  
35.
  • Long, Y. -C, et al. (författare)
  • Analysis and Evaluation of Delay Bounds for Multiplexing Models Based on Network Calculus
  • 2018
  • Ingår i: Tien Tzu Hsueh Pao. - : Chinese Institute of Electronics. - 0372-2112. ; 46:8, s. 1815-1821
  • Tidskriftsartikel (refereegranskat)abstract
    • In resource-sharing communication media such as buses, crossbars and networks, multiplexings are inevitable. While sending packets over a multiplexing node, the worst-case delay bound can be computed using network calculus. The tightness of such delay bound remains an open problem. This paper studies different analysis approaches for multiplexing models, from the single multiplexing node to multi-flow-multi-node model, applying two traffic arrival models, and two service properties when getting equivalent service curves. We analyze per-flow delay bounds with different models, then empirically evaluate the tightness of the delay bounds. Our results show the quality of different analysis models, and how influential each parameter is to tightness.
  •  
36.
  • Long, Yanchen, et al. (författare)
  • Composable Worst-Case Delay Bound Analysis Using Network Calculus
  • 2018
  • Ingår i: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 0278-0070 .- 1937-4151. ; 37:3, s. 705-709
  • Tidskriftsartikel (refereegranskat)abstract
    • Performance analysis is playing an indispensable role in design and evaluation for on-chip networks. In former studies, the end-to-end delay bound is calculated by the equivalent service curve method based on network calculus when resource sharing happens. However, in this paper, we propose a composable method to get the bound. This method uses the aggregated local arrival curve to get the local delay bound first, then calculates the end-to-end bound by summing up local bounds. This method solves the scalability problem and largely decreases the computation complexity compared with the former method.
  •  
37.
  • Lu, Zhonghai, et al. (författare)
  • Aggregate flow-based performance fairness in CMPs
  • 2016
  • Ingår i: ACM Transactions on Architecture and Code Optimization (TACO). - : Association for Computing Machinery (ACM). - 1544-3566 .- 1544-3973. ; 13:4
  • Tidskriftsartikel (refereegranskat)abstract
    • In CMPs, multiple co-executing applications create mutual interference when sharing the underlying network-on-chip architecture. Such interference causes different performance slowdowns to different applications. To mitigate the unfairness problem, we treat traffic initiated from the same thread as an aggregate flow such that causal request/reply packet sequences can be allocated to resources consistently and fairly according to online profiled traffic injection rates. Our solution comprises three coherent mechanisms from rate profiling, rate inheritance, and rate-proportional channel scheduling to facilitate and realize unbiased workload-adaptive resource allocation. Full-system evaluations in GEM5 demonstrate that, compared to classic packet-centric and latest application-prioritization approaches, our approach significantly improves weighted speed-up for all multi-application mixtures and achieves nearly ideal performance fairness.
  •  
38.
  •  
39.
  • Lu, Zhonghai, et al. (författare)
  • Dynamic Traffic Regulation in NoC-Based Systems
  • 2017
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : IEEE Press. - 1063-8210 .- 1557-9999. ; 25:2, s. 556-569
  • Tidskriftsartikel (refereegranskat)abstract
    • In network-on-chip (NoC)-based systems, performance enhancement has primarily focused on the network itself, with little attention paid on controlling traffic injection at the network boundary. This is unsatisfactory because traffic may be over injected, aggravating congestion, and lowering performance. Recently, traffic regulation is proposed as an orthogonal means for performance improvement. Rather than as soon as possible admission, traffic regulation may hold back packet injection by admitting packets into the network only when the accumulated traffic volume at any time interval does not exceed a threshold. These regulation techniques are, however, often static, likely causing overregulation and underregulation. We propose dynamic traffic regulation to improve the system performance for NoC-based multi/many-processor systemson- chip (MPSoC) and chip multi/many-core processor (CMP) designs. It can be applied to MPSoCs for intellectual property integration in an open-loop fashion by injecting traffic according to its run-time profiled characteristics. It can also be applied to CMPs in a closed-loop fashion by admitting traffic fully adaptive to the traffic and network states. Through extensive experiments and results, we show that both the open-loop and closed-loop dynamic regulation techniques can significantly improve the network and system performance.
  •  
40.
  •  
41.
  • Lu, Zhonghai, et al. (författare)
  • Message from the Chairs
  • 2018
  • Ingår i: 12th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2018; Torino; Italy; 4 October 2018 through 5 October 2018. - : Institute of Electrical and Electronics Engineers Inc..
  • Konferensbidrag (refereegranskat)
  •  
42.
  • Lu, Zhonghai, et al. (författare)
  • Thread Voting DVFS for Manycore NoCs
  • 2018
  • Ingår i: IEEE Transactions on Computers. - : IEEE Computer Society. - 0018-9340 .- 1557-9956. ; 67:10, s. 1506-1524
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a thread-voting DVFS technique for manycore networks-on-chip (NoCs). This technique has two remarkable features which differentiate from conventional NoC DVFS schemes. (1) Not only network-level but also thread-level runtime performance indicatives are used to guide DVFS decisions. (2) To resolve multiple perhaps conflicting performance indicatives from many cores, it allows each thread to 'vote' for a V/F level in its own performance interest, and a region-based V/F controller makes dynamic per-region V/F decision according to the major vote. We evaluate our technique on a 64-core CMP in full-system simulation environment GEM5 with both PARSEC and SPEC OMP2012 benchmarks. Compared to a network metric (router buffer occupancy) based approach, it can improve the network energy efficacy measured in MPPJ (million packets per joule) by up to 22 percent for PARSEC and 20 percent for SPEC OMP2012, and the system energy efficacy measured in MIPJ (million instructions per joule) by up to 35 percent for PARSEC and 33 percent for SPEC OMP2012. 
  •  
43.
  • Lu, Zhonghai, et al. (författare)
  • Towards stochastic delay bound analysis for network-on-chip
  • 2015
  • Ingår i: Proceedings - 2014 8th IEEE/ACM International Symposium on Networks-on-Chip, NoCS 2014. - 9781479953479 ; , s. 64-71
  • Konferensbidrag (refereegranskat)abstract
    • We propose stochastic performance analysis in order to provide probabilistic quality-of-service guarantees in on-chip packet-switching networks. In contrast to deterministic analysis which gives per-flow absolute delay bound, stochastic analysis derives per-flow probabilistic delay bounding function, which can be used to avoid over-dimensioning network resources. Based on stochastic network calculus, we build a basic analytic model for an on-chip router, propose and exemplify a stochastic performance analysis flow. In experiments, we show the correctness and accuracy of our analysis, and exhibit its potential in enhancing network utilization with a relaxed delay requirement. Moreover, the benefits of such relaxation is demonstrated through a video playback application.
  •  
44.
  • Lu, Zhonghai, et al. (författare)
  • xMAS-Based QoS Analysis Methodology
  • 2018
  • Ingår i: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 0278-0070 .- 1937-4151. ; 37:2, s. 364-377
  • Tidskriftsartikel (refereegranskat)abstract
    • On-chip communication system design starting from a high-level model can facilitate formal verification of system properties, such as safety and deadlock freedom. Yet, analyzing its quality-of-service (QoS) property, in our context, per-flow delay bound, is an open challenge. Based on executable micro-architectural specification (xMAS) which is a formal framework modeling communication fabrics, we first present how to model a classic input-queuing virtual channel router using the xMAS primitives and then a QoS analysis methodology using network calculus (NC). Thanks to the precise semantics of the xMAS primitives, the router can be modeled in different variants, which cannot be otherwise captured by normal ad hoc box diagrams. The analysis methodology consists of three steps: 1) given network and flow knowledge, we first create a well-defined precise xMAS model for a specific application on a concrete on-chip network; 2) the specific xMAS model is then mapped to an NC graph (NCG) following a set of mapping rules; and 3) finally, existing QoS analysis techniques can be applied to analyze the NCG to obtain end-to-end delay bound per flow. We also show how to apply the technique to a typical all-to-one communication pattern on a binary-tree network and conduct an SoC case study, exemplifying the step-by-step analysis procedure and discussing the tightness of the results.
  •  
45.
  • Lv, Hao, et al. (författare)
  • Exploiting Minipage-level Mapping to Improve Write Efficiency of NAND Flash
  • 2018
  • Ingår i: 2018 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE AND STORAGE (NAS). - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • Pushing NAND flash memory to higher density, manufacturers are aggressively enlarging the flash page size. However, the sizes of I/O requests in a wide range of scenarios do not grow accordingly. Since a page is the unit of flash read/write operations, traditional flash translation layers (FTLs) maintain the page mapping regularity. Hence, small random write requests become common, leading to extensive partial logical page writes. This write inefficiency significantly degrades the performance and increases the write amplification of flash storage. In this paper, we first propose a configurable mapping layer, called minipage, whose size is set to match I/O request sizes. The minipage-level mapping provides better flexibility in handling small writes at the cost of sequential read performance degradation and a larger mapping table. Then, we propose a new FTL, called PM-FTL, that exploits the minipage-level mapping to improve write efficiency and utilizes the page-level mapping to reduce the costs caused by the minipage-level mapping. Finally, trace-driven simulation results show that compared to traditional FTLs, PM-FTL reduces the write amplification and flash storage response time by an average of 33.4% and 19.1%, up to 57.7% and 34%, respectively, under 16KB flash pages and 4KB minipages.
  •  
46.
  • Ma, Ning, et al. (författare)
  • A 101.4 GOPS/W Reconfigurable and Scalable Control-centric Embedded Processor for Domain-specific Applications
  • 2016
  • Ingår i: Proceedings - IEEE International Symposium on Circuits and Systems. - : IEEE. - 9781479953400 ; , s. 1746-1749
  • Konferensbidrag (refereegranskat)abstract
    • Increasing the energy efficiency and performance while providing the customizability and scalability is vital for embedded processors adapting to domain-specific applications such as Internet of Things. In this paper, we proposed a reconfigurable and scalable control-centric architecture, and implemented the design consisting of two cores and an on-chip multi-mode router in 65 nm technology. The reconfigurability is enabled by the restructurable sequence mapping table (SMT) thus the reorganizable functional units. Owing to the integration of the multi-mode router, on-chip or inter-chip network for multi-/many-core computing can be composed for performance extension on demand even in the post-fabrication stage. Control-centric design simplifies the control logic, shrinks the non-functional units and orchestrates the operations to increase the hard are utilization and reduce the excessive data movement for high energy efficiency. As a result, the processor can both conduct general-purpose processing with 29% smaller code size and application-specific processing with over 10 times performance improvement when implementing AES by SMT. The dual-core processor consumes 19.7 μW/MHz with die size of 3.5 mm2. The achieved energy efficiency is 101.4GOPS/W.
  •  
47.
  •  
48.
  • Ma, Ning, et al. (författare)
  • Design and Implementation of Multi-mode Routers for Large-scale Inter-core Networks
  • 2016
  • Ingår i: Integration. - : Elsevier. - 0167-9260 .- 1872-7522. ; 53, s. 1-13
  • Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract
    • Constructing on-chip or inter-silicon (inter-die/inter-chip) networks to connect multiple processors extends the system capability and scalability. It is a key issue to implement a flexible router that can fit into various application scenarios. This paper proposes a multi-mode adaptable router that can support both circuit and wormhole switching with supplying flexible working strategies for specific traffic patterns in diverse applications. The limitation of mono-mode switched routers is shown at first, followed by algorithm exploration in the proposed router for choosing the proper working strategy in a specific network. We then present the performance improvement when applying the mixed circuit/wormhole switching mode to different applications, and analyze the image decoding as a case study. The multi-mode router has been implemented with different configurations in a 65 nm CMOS technology. The one with 8-bit flit width is demonstrated together with a multi-core processor to show the feasibility. Working at 350 MHz, the average power consumption of the whole system is 22 mW.
  •  
49.
  • Ma, Ning, et al. (författare)
  • Implementing MVC Decoding on Homogeneous NoCs : Circuit Switching or Wormhole Switching
  • 2015
  • Konferensbidrag (refereegranskat)abstract
    • To implement multiview video decoding on network on-chip (NoC) based homogeneous multicore architectures, the selection of switching techniques for routers is one of the most important aspects for design space exploration. Circuit switching and wormhole switching are two most feasible switching techniques for on-chip networks. To choose the suitable switching technique, we perform the comparison on decoding speed of the whole system, link utilization and delay between circuit switching and wormhole switching for implementing eight-view QVGA video decoding on 4 × 4 NoCs at 30 fps. The required link bandwidths are both around 800 Mbps with the similar network utilization and delay. We conclude that, to implement multiview video decoding on homogeneous NoCs, circuit switching is more suitable considering the similar performance and lower cost compared with wormhole switching.
  •  
50.
  • Ma, Ning (författare)
  • Ultra-low-power Design and Implementation of Application-specific Instruction-set Processors for Ubiquitous Sensing and Computing
  • 2015
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • The feature size of transistors keeps shrinking with the development of technology, which enables ubiquitous sensing and computing. However, with the break down of Dennard scaling caused by the difficulties for further lowering supply voltage, the power density increases significantly. The consequence is that, for a given power budget, the energy efficiency must be improved for hardware resources to maximize the performance. Application-specific integrated circuits (ASICs) obtain high energy efficiency at the cost of low flexibility for various applications, while general-purpose processors (GPPs) gain generality at the expense of efficiency.To provide both high energy efficiency and flexibility, this dissertation explores the ultra-low-power design of application-specific instruction-set processors (ASIP) for ubiquitous sensing and computing. Two application scenarios, i.e. high-throughput compute-intensive processing for multimedia and low-throughput low-cost processing for Internet of Things (IoT) are implemented in the proposed ASIPs.Multimedia stream processing for human-computer interaction is always featured with high data throughput. To design processors for networked multimedia streams, customizing application-specific accelerators controlled by the embedded processor is exploited. By abstracting the common features from multiple coding algorithms, video decoding accelerators are implemented for networked multi-standard multimedia stream processing. Fabricated in 0.13 $\mu$m CMOS technology, the processor running at 216 MHz is capable of decoding real-time high-definition video streams with power consumption of 414 mW.When even higher throughput is required, such as in multi-view video coding applications, multiple customized processors will be connected with an on-chip network. Design problems are further studied for selecting the capability of single processors, the number of processors, the capacity of communication network, as well as the task assignment schemes.In the IoT scenario, low processing throughput but high energy efficiency and adaptability are demanded for a wide spectrum of devices. In this case, a tile processor including a multi-mode router and dual cores is proposed and implemented. The multi-mode router supports both circuit and wormhole switching to facilitate inter-silicon extension for providing on-demand performance. The control-centric dual-core architecture uses control words to directly manipulate all hardware resources. Such a mechanism avoids introducing complex control logics, and the hardware utilization is increased. Programmable control words enable reconfigurability of the processor for supporting general-purpose ISAs, application-specific instructions and dedicated implementations. The idea of reducing global data transfer also increases the energy efficiency. Finally, a single tile processor together with network of bare dies and network of packaged chips has been demonstrated as the result. The processor implemented in 65 nm low leakage CMOS technology and achieves the energy efficiency of 101.4 GOPS/W for each core.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-50 av 84

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy