SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Lu Zhonghai) srt2:(2020-2024)"

Sökning: WFRF:(Lu Zhonghai) > (2020-2024)

  • Resultat 1-50 av 54
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Chen, Hui, et al. (författare)
  • A CORDIC-Based Architecture with Adjustable Precision and Flexible Scalability to Implement Sigmoid and Tanh Functions
  • 2020
  • Ingår i: IEEE International Symposium on Circuits and Systems, ISCAS 2020. - : IEEE.
  • Konferensbidrag (refereegranskat)abstract
    • In the artificial neural networks, tanh (hyperbolic tangent) and sigmoid functions are widely used as activation functions. Past methods to compute them may have shortcomings such as low precision or inflexible architecture that is difficult to expand, so we propose a CORDIC-based architecture to implement sigmoid and tanh functions, which has adjustable precision and flexible scalability. It just needs shift-add-or-subtract operations to compute high-accuracy results and is easy to expand the input range through scaling the negative iterations of CORDIC without changing the original architecture. We adopt the control variable method to explore the accuracy distribution through software simulation. A specific case (ARCH:(1, 15, 18), RMSE: 10(-6)) is designed and synthesized under the TSMC 40nm CMOS technology, the report shows that it has the area of 36512.78 mu m(2) and power of 12.35mW at the frequency of 1GHz. The maximum work frequency can reach 1.5GHz, which is better than the state-of-the-art methods.
  •  
2.
  • Chen, Hui, et al. (författare)
  • A General Methodology and Architecture for Arbitrary Complex Number Nth Root Computation
  • 2021
  • Ingår i: 2021 SCAS 2021/IEEE International Symposium on Circuits and Systems. - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • As the existing complex number Nth root computation methods are relatively discrete, we propose a general method and architecture based on coordinate rotation digital computer (CORDIC) to compute arbitrary complex number Nth root for the first time. Our method performs the tasks of computing complex modulus, complex phase angle, real Nth root, sine function and cosine function, which can be implemented by circular CORDIC, linear CORDIC and hyperbolic CORDIC. Based on these CORDICs, our proposed architecture can not only improve the hardware efficiency just through shift-add operations, but also flexibly adjust the precision and the input range of complex number Nth root. To prove its feasibility, we conduct a software simulation and implement an example circuit in hardware. Under the TSMC 28nm CMOS technology, we synthesize it and get the report that it has the area of 6561 mu m(2) and the power of 3.95mW at the frequency of 1.5GHz.
  •  
3.
  • Chen, Hui, et al. (författare)
  • An Efficient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions
  • 2020
  • Ingår i: Electronics. - : MDPI. - 2079-9292. ; 9:10
  • Tidskriftsartikel (refereegranskat)abstract
    • The efficient and precise hardware implementations of tanh and sigmoid functions play an important role in various neural network algorithms. Different applications have different requirements for accuracy. However, it is difficult for traditional methods to achieve adjustable precision. Therefore, we propose an efficient-hardware, adjustable-precision and high-speed architecture to implement them for the first time. Firstly, we present two methods to implement sigmoid and tanh functions. One is based on the rotation mode of hyperbolic CORDIC and the vector mode of linear CORDIC (called RHC-VLC), another is based on the carry-save method and the vector mode of linear CORDIC (called CSM-VLC). We validate the two methods by MATLAB and RTL implementations. Synthesized under the TSMC 40 nm CMOS technology, we find that a special case AR divide VR(3,0), based on RHC-VLC method, has the area of 4290.98 mu m2 and the power of 1.69 mW at the frequency of 1.5 GHz. However, under the same frequency, AR divide VC(3) (a special case based on CSM-VLC method) costs 3196.36 mu m2 area and 1.38 mW power. They are both superior to existing methods for implementing such an architecture with adjustable precision.
  •  
4.
  • Chen, H., et al. (författare)
  • Huicore : A Generalized Hardware Accelerator for Complicated Functions
  • 2022
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-8328 .- 1558-0806. ; 69:6, s. 2463-2476
  • Tidskriftsartikel (refereegranskat)abstract
    • Emerging advanced System-on-Chip (SoC) designs contain more and more complicated functions to be accelerated. This presents a challenge to conventional design approaches which use different hardware architectures or separate hardware accelerators to implement the various functions. To tackle this challenge, for the first time, we propose a generalized hardware accelerator called 'Huicore' to speed up diverse functions on the same substrate. Through the analysis and transformation of mathematical characteristics, we reveal the commonality of many complicated functions using the CORDIC algorithm. Then we explore a reconfigurable architecture to implement them. The proposed reconfigurable accelerator can not only accelerate the implementation of many complicated functions, but also has small area, low power consumption and high precision. It is very suitable for integration in a SoC system to accelerate the implementation of various applications.
  •  
5.
  • Chen, Hui, et al. (författare)
  • Hyperbolic CORDIC-Based Architecture for Computing Logarithm and Its Implementation
  • 2020
  • Ingår i: IEEE Transactions on Circuits and Systems - II - Express Briefs. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-7747 .- 1558-3791. ; 67:11, s. 2652-2656
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a CORDIC (Coordinate Rotation Digital Computer)-based method to compute the logarithm function with base 2 and validate this method by software simulation and hardware implementation. Technically, we overcome the limitation of traditional hyperbolic CORDIC and transform it based on the idea of generalized hyperbolic CORDIC so that it can be used to compute $log_{2}x\;(x\;\epsilon \;[1,2))$ . The proposed method requires only simple shift-and-add operations and has a great tradeoff between precision (or speed) and area. In MATLAB, we provide different precisions corresponding to the iterations of the transformed CORDIC for user needs. Using a pipelined structure and setting the number of iterations to be 16 (the average relative error is $2.09\times 10<^>{-6}$ ), we implement an example hardware circuit. Synthesized under the SMIC 65nm CMOS technology, the circuit has an area of 24100 $\mu m<^>{2}$ and computation time of 11.1 ns, which can save 31.04x0025; area and improve 6.92x0025; computation speed averagely compared with existing methods.
  •  
6.
  • Chen, Hui, et al. (författare)
  • Low-Complexity High-Precision Method and Architecture for Computing the Logarithm of Complex Numbers
  • 2021
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-8328 .- 1558-0806. ; 68:8, s. 3293-3304
  • Tidskriftsartikel (refereegranskat)abstract
    • This paper proposes a low-complexity method and architecture to compute the logarithm of complex numbers based on coordinate rotation digital computer (CORDIC). Our method takes advantage of the vector mode of circular CORDIC and hyperbolic CORDIC, which only needs shift-add operations in its hardware implementation. Our architecture has lower design complexity and higher performance compared with conventional architectures. Through software simulation, we show that this method can achieve high precision for logarithm computation, reaching the relative error of 10(-7). Finally, we design and implement an example circuit under TSMC 28nm CMOS technology. According to the synthesis report, our architecture has smaller area, lower power consumption, higher precision and wider operation range compared with the alternative architectures.
  •  
7.
  • Chen, Hui, et al. (författare)
  • Symmetric-Mapping LUT-Based Method and Architecture for Computing X-Y-Like Functions
  • 2021
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 1549-8328 .- 1558-0806. ; 68:3, s. 1231-1244
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose a new method and hardware architecture to compute the functions expressed as XY ( X and Y are arbitrary floating-point numbers), which can support arbitrary Nth root, exponential and power operations. Because of the complexity of direct computation, we usually convert it to logarithm, multiplication, and antilogarithm operations. Traditional approaches suffer from long latency, large area and high power consumption. To solve this problem, we propose a symmetric-mapping lookup table (SM-LUT) to be capable of computing log(2) x (x is an element of [1, 2]) and 2 x (x is an element of [0, 1]) simultaneously. It lays the foundation for computing XY. To further improve hardware performance of our architecture, we propose a multi-region address searcher to speed up the calculation of SM-LUT. In addition, we use an optimized Vedic multiplier to shorten the critical path and improve the efficiency of multiplication, which is included in computing X-Y. Under the TSMC 40nm CMOS technology, we design and synthesize a reference circuit to compute X-Y with a maximum relative error of 10(-3). The report shows that the reference circuit achieves the area of 14338.50 mu m(2) and the power consumption of 4.59 mW at the frequency of 1 GHz. In comparison with the state-of-the-art work under the same input range and similar precision, it saves 78.57% area and 80.42% power consumption for (N)root R computation and 82.89% area and 81.89% power consumption for R-N computation averagely. On top of that, our architecture reduces the computation latency by 62.77% averagely and has one more order of magnitude of energy efficiency than others.
  •  
8.
  • Chen, Qinyu, et al. (författare)
  • An Efficient Accelerator for Multiple Convolutions From the Sparsity Perspective
  • 2020
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 1063-8210 .- 1557-9999. ; 28:6, s. 1540-1544
  • Tidskriftsartikel (refereegranskat)abstract
    • Convolutional neural networks (CNNs) have emerged as one of the most popular ways applied in many fields. These networks deliver better performance when going deeper and larger. However, the complicated computation and huge storage impede hardware implementation. To address the problem, quantized networks are proposed. Besides, various convolutional structures are designed to meet the requirements of different applications. For example, compared with the traditional convolutions (CONVs) for image classification, CONVs for image generation are usually composed of traditional CONVs, dilated CONVs, and transposed CONVs, leading to a difficult hardware mapping problem. In this brief, we translate the difficult mapping problem into the sparsity problem and propose an efficient hardware architecture for sparse binary and ternary CNNs by exploiting the sparsity and low bit-width characteristics. To this end, we propose an ineffectual data removing (IDR) mechanism to remove both the regular and irregular sparsity based on dual-channel processing elements (PEs). Besides, a flexible layered load balance (LLB) mechanism is introduced to alleviate the load imbalance. The accelerator is implemented with 65-nm technology with a core size of 2.56 mm(2). It can achieve 3.72-TOPS/W energy efficiency at 50.1 mW, which makes it a promising design for embedded devices.
  •  
9.
  • Chen, Qinyu, et al. (författare)
  • Enabling Energy-Efficient Inference for Self-Attention Mechanisms in Neural Networks
  • 2022
  • Ingår i: 2022 Ieee International Conference On Artificial Intelligence Circuits And Systems (Aicas 2022). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 25-28
  • Konferensbidrag (refereegranskat)abstract
    • The study of specialized accelerators tailored for neural networks is becoming a promising topic in recent years. Such existing neural network accelerators are usually designed for convolutional neural networks (CNNs) or recurrent neural networks have been (RNNs), however, less attention has been paid to the attention mechanisms, which is an emerging neural network primitive with the ability to identify the relations within input entities. The self-attention-oriented models such as Transformer have achieved great performance on natural language processing, computer vision and machine translation. However, the self-attention mechanism has intrinsically expensive computational workloads, which increase quadratically with the number of input entities. Therefore, in this work, we propose an software-hardware co-design solution for energy-efficient self-attention inference. A prediction-based approximate self-attention mechanism is introduced to substantially reduce the runtime as well as power consumption, and then a specialized hardware architecture is designed to further increase the speedup. The design is implemented on a Xilinx XC7Z035 FPGA, and the results show that the energy efficiency is improved by 5.7x with less than 1% accuracy loss.
  •  
10.
  • Chen, Yizhi, 1995-, et al. (författare)
  • Accelerating Non-Negative Matrix Factorization on Embedded FPGA with Hybrid Logarithmic Dot-Product Approximation
  • 2022
  • Ingår i: Proceedings. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 239-246
  • Konferensbidrag (refereegranskat)abstract
    • Non-negative matrix factorization (NMF) is an ef-fective method for dimensionality reduction and sparse decom-position. This method has been of great interest to the scien-tific community in applications including signal processing, data mining, compression, and pattern recognition. However, NMF implies elevated computational costs in terms of performance and energy consumption, which is inadequate for embedded applications. To overcome this limitation, we implement the vector dot-product with hybrid logarithmic approximation as a hardware optimization approach. This technique accelerates floating-point computation, reduces energy consumption, and preserves accuracy. To demonstrate our approach, we employ a design exploration flow using high-level synthesis on an embedded FPGA. Compared with software solutions on ARM CPU, this hardware implementation accelerates the overall computation to decompose matrix by 5.597 × and reduces energy consumption by 69.323×. Log approximation NMF combined with KNN(k-nearest neighbors) has only 2.38% decreasing accuracy compared with the result of KNN processing the matrix after floating-point NMF on MNIST. Further on, compared with a dedicated floating-point accelerator, the logarithmic approximation approach achieves 3.718× acceleration and 8.345× energy reduction. Compared with the fixed-point approach, our approach has an accuracy degradation of 1.93% on MNIST and an accuracy amelioration of 28.2% on the FASHION MNIST data set without pre-knowledge of the data range. Thus, our approach has better compatibility with the input data range.
  •  
11.
  • Chen, Yizhi, 1995-, et al. (författare)
  • Online Image Sensor Fault Detection for Autonomous Vehicles
  • 2022
  • Ingår i: Proceedings. - : Institute of Electrical and Electronics Engineers Inc.. ; , s. 120-127
  • Konferensbidrag (refereegranskat)abstract
    • Automated driving vehicles have shown glorious potential in the near future market due to the high safety and convenience for drivers and passengers. Image sensors' reliability attract many researchers' interests as many image sensors are used in autonomous vehicles. We propose an online image sensor fault detection method based on comparing the historical variances of normal pixels and defective pixels to detect faults. For fault pixels without uncertainty, with a detecting window of more than 30 frames, we get 100% accuracy and 100% recall on realistic continuous traffic pictures from the KITTI data set. We also explore the influence of fault pixel values' uncertainty from 0% to 25% and study different fixed thresholds and a dynamic threshold for judgments. Strict threshold, which is 0.1, has a high accuracy (99.16%) but has a low recall (34.46%) for 15% uncertainty. Loose threshold, which is 0.3, has a relatively high recall (83.78%) but mistakes too many normal pixels with 18.17% accuracy for 15% uncertainty. Our dynamic threshold balances the accuracy and recall. It gets 100% accuracy and 58.69% recall for 5% uncertainty and 78.38% accuracy and 55.39% recall for 15% uncertainty. Based on the detected damage pixel rate, we develop a health score for evaluating the image sensor system intuitively. It can also be helpful for making decision about replacing cameras.
  •  
12.
  • Cui, L., et al. (författare)
  • A Low Bit-Width LDPC Min-Sum Decoding Scheme for NAND Flash
  • 2022
  • Ingår i: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 0278-0070 .- 1937-4151. ; 41:6, s. 1971-1975
  • Tidskriftsartikel (refereegranskat)abstract
    • For NAND flash memory, designing a good low-density parity-check (LDPC) decoding algorithm could ensure data reliability. When the decoding algorithm is implemented in hardware, it is necessary to achieve attractive trade off between implementation complexity and decoding performance. In this paper, a novel low bit-width decoding scheme is introduced. In this scheme, the Quasi-Cyclic LDPC (QC-LDPC) is used, and the row-layered normalized min-sum algorithm is improved by restricting the amplitude of minimum and second-minimum values in each check node (CN) updating. The simulation shows that our approach achieves a lower UBER (Uncorrectable Bit Error Rate) with a negligible increase in computational complexity, especially with low precision input log-likelihood ratio (LLR).
  •  
13.
  • Dong, Xiaoyu, et al. (författare)
  • Gait Recognition Based on Modified OVR-CSP Fusion Feature and LSTM
  • 2024
  • Ingår i: 2024 7th International Conference on Advanced Algorithms and Control Engineering, ICAACE 2024. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 1551-1554
  • Konferensbidrag (refereegranskat)abstract
    • This paper proposes a gait recognition method based on the modified OVR-CSP fusion feature of plantar pressure and Long Short-Term Memory classification (referred to as the OVR-CSP-LSTM model). 10 subjects conducted 4 type of gait experiments including normal speed walking, fast walking, slow walking, imitating stroke gait walking in this paper. Transfer the commonly used Common Spatial Pattern (CSP) feature extraction method for EEG to plantar pressure signals, and splice the OVR-CSP features of 2-class, 3-class and 4-class, adopting Long Short Term Memory Network (LSTM) for classification. In this paper, the Intra-patient mode and Inter-patient mode of 10 people are modeled and compared respectively, and the recognition effects under different sensor number and different position sensors' combination are also studied. The experimental results show that the proposed model has good performance for both modes. The method proposed in this article is expected to be applied to multi-sensor signal processing and classification with spatial characteristics.
  •  
14.
  • Gao, Qian, et al. (författare)
  • Dynamic and Traffic-Aware Medium Access Control Mechanisms for Wireless NoC Architectures
  • 2021
  • Ingår i: 2021 Ieee International Symposium On Circuits And Systems (ISCAS). - : IEEE.
  • Konferensbidrag (refereegranskat)abstract
    • Wireless NoC (WiNoC) has low latency and simple wiring, which can reduce the energy consumption caused by the metal interconnection in traditional NoC architectures. However, traditional time division based media access control (MAC) mechanism in WiNoC is not aware of different wireless interfaces' (WIs) traffic demands, resulting in an unreasonable distribution of wireless communication channels and degradation in performance. Hence, in order to dynamically allocate wireless channels to the WIs based on their traffic demands, a dynamic and traffic-aware MAC mechanism is required. In this paper, we design a traffic demand predictor for each WI based on its current and history traffic conditions. According to the predicted demands, we are able to allocate access to wireless channels dynamically and switch between two kinds of time division based MAC mechanisms. Simulations under various conditions indicate that the average delay decreases by 30% and 20% on average compared with a traditional MAC mechanism and an existing dynamic time division based one, respectively. Moreover, the network with the dynamic and traffic-aware MAC enters the saturation point at a higher packet injection rate.
  •  
15.
  • Guo, Shize, et al. (författare)
  • Securing IoT Space via Hardware Trojan Detection
  • 2020
  • Ingår i: IEEE Internet of Things Journal. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 2327-4662. ; 7:11, s. 11115-11122
  • Tidskriftsartikel (refereegranskat)abstract
    • Hardware Trojan (HT) is a malicious modification in the chip circuitry, which may lead to undesired chip function changing or sensitive information leaking once activated. As recently studied, HT has become one of the main threats for Internet-of-Things (IoT) security, and therefore, protecting IoT against the HT attack attracts growing attention from IoT researchers. In this article, we propose an HT detection technique which makes use of chip temporal thermal information and self-organizing map (SOM) neural network to automatically isolate the Trojan-infected chips with the Trojan-free ones, and meanwhile, confirm the Trojan location at the infected chips. The experimental results reveal that our method is effective. Specifically, for the Trust-hub benchmarks, it can detect HTs which increase only 0.02% power consumption of the original design and localize the Trojan positions precisely without any error. In addition, we demonstrate the advantages of our method over two existing HT detection methods, namely, the thermal and power map (TPM) and ring oscillator net (RON), and make a thorough discussion on how the thermal image resolution, chip technology, and clustering algorithm affect the Trojan detection results.
  •  
16.
  • Hu, X., et al. (författare)
  • A Configurable Hardware Architecture for Runtime Application of Network Calculus
  • 2021
  • Ingår i: International journal of parallel programming. - : Springer Nature. - 0885-7458 .- 1573-7640. ; 49:5, s. 745-760
  • Tidskriftsartikel (refereegranskat)abstract
    • Network Calculus has been a foundational theory for analyzing and ensuring Quality-of-Service (QoS) in a variety of networks including Networks on Chip (NoCs). To fulfill dynamic QoS requirements of applications, runtime application of network calculus is essential. However, the primitive operations in network calculus such as arrival curve, min-plus convolution and min-plus deconvolution are very time consuming when calculated in software because of the large volume and long latency of computation. For the first time, we propose a configurable hardware architecture to enable runtime application of network calculus. It employs a unified pipeline that can be dynamically configured to efficiently calculate the arrival curve, min-plus convolution, and min-plus deconvolution at runtime. We have implemented and synthesized this hardware architecture on a Xilinx FPGA platform to quantify its performance and resource consumption. Furthermore, we have built a prototype NoC system incorporating this hardware for dynamic flow regulation to effectively achieve QoS at runtime. 
  •  
17.
  • Hu, X., et al. (författare)
  • A Configurable Hardware Architecture for Runtime Application of Network Calculus
  • 2021
  • Ingår i: Lecture Notes in Computer Science book series (LNTCS,volume 12639). - Cham : Springer Science and Business Media Deutschland GmbH. ; , s. 203-216
  • Konferensbidrag (refereegranskat)abstract
    • Network Calculus has been a foundational theory for analyzing and ensuring Quality-of-Service (QoS) in a variety of networks including Networks on Chip (NoCs). To fulfill dynamic QoS requirements of applications, runtime application of network calculus is essential. However, the primitive operations in network calculus such as arrival curve, min-plus convolution and min-plus deconvolution are very time consuming when calculated in software because of the large volume and long latency of computation. For the first time, we propose a configurable hardware architecture to enable runtime application of network calculus. It employs a unified pipeline that can be dynamically configured to efficiently calculate the arrival curve, min-plus convolution, and min-plus deconvolution at runtime. We have implemented and synthesized this hardware architecture on a Xilinx FPGA platform to quantify its performance and resource consumption. Furthermore, we have built a prototype NoC system incorporating this hardware for dynamic flow regulation to effectively achieve QoS at runtime. 
  •  
18.
  • Hu, X., et al. (författare)
  • End-to-End System QoS Modeling based on Network Calculus : A Multi-Media Case Study
  • 2020
  • Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : Association for Computing Machinery. ; , s. 80-83
  • Konferensbidrag (refereegranskat)abstract
    • Network Calculus has been used for formal modeling and analysis of timing properties, i.e., Quality-of-Service (QoS), in realtime embedded systems. Prior analyses often focus only on a particular aspect of an entire system such as scheduling algorithm, traffic shaping, or a hardware subsystem such as network-on-chip, leaving end-to-end system QoS analysis seldomly touched. Based on a video playback system, we intend to conduct an end-to-end system-level QoS modeling using network calculus. We build an abstract end-to-end service model for various software routines and hardware modules dealing with both computation and communication. In FPGA prototype experiments running real video clips, we show that the parameters in the service model can be measured and QoS-related operational details can be monitored for continuous assessment of QoS fulfillment at runtime.
  •  
19.
  • Hu, Yuping, et al. (författare)
  • LM-SVM-DT Based Working State Recognition for Washing Machine's Audio Signal
  • 2022
  • Ingår i: 2022 IEEE International Conference on Artificial Intelligence and Computer Applications, ICAICA 2022. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 550-554
  • Konferensbidrag (refereegranskat)abstract
    • In order to make a reasonable and effective judgment on the quality inspection and fault diagnosis of intelligent electrical appliances, this paper proposes a method of working state recognition for washing machine based on audio signal, named as LM-SVM-DT. The whole working process of the washing machine is divided into four basic states: water intake, soaking, washing and dehydration. The Log-mel features of the audio signal after bandpass filtering are extracted and modeled by the decision tree classification method based on SVM. That is, soaking and non-soaking states are separated at first, then washing and non-washing states are separated in non-soaking states, and finally water intake and dehydration states are separated in non-washing states. Taking the standard-washing-mode of a certain type of washing machine as an example to verify the algorithm, the experimental results show that the state recognition rate is as high as 0.9920. The results show that the model proposed in this paper is effective and feasible.
  •  
20.
  • Liu, Qingshan, et al. (författare)
  • ECG abnormality detection Based on Multi-domain combination features and LSTM
  • 2023
  • Ingår i: 2023 4th International Conference on Computer Engineering and Application, ICCEA 2023. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 565-569
  • Konferensbidrag (refereegranskat)abstract
    • Most scholars use fixed-length sample to ECG abnormalities based on MIT-BIH dataset, which lead to information loss. To address this problem, this paper proposes a method for ECG abnormality detection based on TSH-L method. The TSH-L method include:(1) Use the 3R ECG sample selection method to select ECG samples.(2) Extract multi-domain combination features including time-domain features, frequency domain features and time-frequency domain features.(3) LSTM is used for classification, and the algorithm is trained and tested based on the MIT-BIH dataset, obtain relatively optimal features as spliced normalized fusion features including kurtosis, skewness and RR interval time domain features, STFT-based sub-band spectrum features, and harmonic ratio features. Experiments show that: TSH-L method proposed in the paper has a high accuracy of 97.74% for the detection of ECG abnormalities of MIT-BIH dataset. The method 3R-TSH-L proposed in this paper is expected to be widely used in family-oriented healthcare.
  •  
21.
  • Liu, Qingshan, et al. (författare)
  • Health warning based on 3R ECG Sample's combined features and LSTM
  • 2023
  • Ingår i: Computers in Biology and Medicine. - : Elsevier BV. - 0010-4825 .- 1879-0534. ; 162
  • Tidskriftsartikel (refereegranskat)abstract
    • Most researches use the fixed-length sample to identify ECG abnormalities based on MIT ECG dataset, which leads to information loss. To address this problem, this paper proposes a method for ECG abnormality detection and health warning based on ECG Holter of PHIA and 3R-TSH-L method. The 3R-TSH-L method is implemented by:(1) getting 3R ECG samples using Pan-Tompkins method and using volatility to obtain high-quality raw ECG data; (2) extracting combination features including time-domain features, frequency domain features and time-frequency domain features; (3) using LSTM for classification, training and testing the algorithm based on the MIT-BIH dataset, and obtaining relatively optimal features as spliced normalized fusion features including kurtosis, skewness and RR interval time domain features, STFT-based sub-band spectrum features, and harmonic ratio features. The ECG data were collected using the self-developed ECG Holter (PHIA) on 14 subjects, aged between 24 and 75 including both male and female, to build the ECG dataset (ECG-H). The algorithm was transferred to the ECG-H dataset, and a health warning assessment model based on abnormal ECG rate and heart rate variability weighting was proposed. Experiments show that 3R-TSH-L method proposed in the paper has a high accuracy of 98.28% for the detection of ECG abnormalities of MIT-BIH dataset and a good transfer learning ability of 95.66% accuracy for ECG-H. The health warning model was also testified to be reasonable. The key technique of the ECG Holter of PHIA and the method 3R-TSH-L proposed in this paper is expected to be widely used in family-oriented healthcare.
  •  
22.
  • Liu, W, et al. (författare)
  • DEPS : Exploiting a Dynamic Error Prechecking Scheme to Improve the Read Performance of SSD
  • 2021
  • Ingår i: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. - : Institute of Electrical and Electronics Engineers Inc.. - 0278-0070 .- 1937-4151. ; 40:1, s. 66-77
  • Tidskriftsartikel (refereegranskat)abstract
    • 3D NAND flash memory is gradually being widely used in solid state drives (SSD), leading to increasing storage capacity. However, the read performance of SSD is sacrificed for decoding operations which are executed to guarantee the data reliability. No matter whether the data have bit errors, they will be sent to error correcting code (ECC) engine to decode, introducing a high read delay of SSD. Error prechecking can help to avoid the redundant decoding operations for the error-free data, but it induces extra checking overhead to the error data. Motivated by this, we carry out comprehensive experiments to analyze the distribution of bit errors in 3D NAND flash memory. The preliminary experimental results show that there are a large number of pages read without errors in the early lifetime of 3D NAND flash memory. Based on the observations and analyses, we propose a model to estimate the error-free ratio, and utilize it to design a dynamic error prechecking scheme (DEPS) to bypass the decoding operation for the error-free data in 3D NAND flash memory and improve the read performance of SSD. Furthermore, by dividing a large page into small subpages, DEPS releases more error-free data, which significantly improves the read performance of SSD. Evaluation results from real-world traces demonstrate that by implementing DEPS, the average read performance of SSD is enhanced by 35% 55% with 3D MLC NAND flash memory. 
  •  
23.
  • Liu, Weihua, et al. (författare)
  • Modeling of Threshold Voltage Distribution in 3D NAND Flash Memory
  • 2021
  • Ingår i: PROCEEDINGS OF THE 2021 DESIGN, AUTOMATION &amp; TEST IN EUROPE CONFERENCE &amp; EXHIBITION (DATE 2021). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 1729-1732
  • Konferensbidrag (refereegranskat)abstract
    • 3D NAND flash memory faces unprecedented complicated interference than planar NAND flash memory, resulting in more concern regarding reliability and performance. Stronger error correction code (ECC) and adaptive reading strategies are proposed to improve the reliability and performance taking a threshold voltage (Vth) distribution model as the backbone. However, the existing modeling methods are challenged to develop such a Vth distribution model for 3D NAND flash memory. To facilitate it, in this paper, we propose a machine learning-based modeling method. It employs a neural network taking advantage of the existing modeling methods and fully considers multiple interferences and variations in 3D NAND flash memory. Compared with state-of-the-art models, evaluations demonstrate it is more accurate and efficient for predicting Vth distribution.
  •  
24.
  • Lu, Zhonghai, et al. (författare)
  • Age Feature Enhanced Neural Network for RUL Estimation of Power Electronic Devices
  • 2023
  • Ingår i: 2023 IEEE International Conference on Prognostics and Health Management, ICPHM 2023. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 38-47
  • Konferensbidrag (refereegranskat)abstract
    • Like other deep learning problems, critical features are critical to enable effective estimation of Remaining Useful Lifetime (RUL) for power electronic devices using Neural Networks (NNs). However, these critical features are often indirectly obtained after data pre-processing, complicated either in form (high dimension) or in computation (computation-intensive pre-processing). In the paper, we suggest adding a simple direct feature, age, into the NN based RUL estimation technique. The rationale for incorporating this feature is that the device lifetime is a sum of past time (age) plus RUL. Thus it has a strong correlation to RUL. In our experiments using accelerated aging tests, we show that the new age feature enhanced Recurrent Neural Network (RNN) model can significantly improve estimation accuracy and shorten training convergence time. It also outperforms a state-of-The-Art RNN model using derived time-domain statistical features.
  •  
25.
  • Lu, Zhonghai, et al. (författare)
  • Computational Network-on-Chip as Convolution Engine
  • 2024
  • Ingår i: 2024 International VLSI Symposium on Technology, Systems and Applications, VLSI TSA 2024 - Proceedings. - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • Inspired by PiN, Processing in Network-on-Chip (NoC), we propose a computational NoC as a convolution engine for accelerating convolutional neural networks in hardware. In contrast to traditional compute architectures where computation and communication are conducted serially and in separation, our computational NoC enables in-transit computation, meaning that computation is performed while packets are transferred in the network. In the paper, we present the router architecture that supports the novel in-transit computation concept, and use a running example to detail the entire convolution process in the computational NoC. Finally, we show simulated performance results in comparison with traditional NoC-based convolution engine.
  •  
26.
  •  
27.
  • Lu, Zhonghai (författare)
  • PiN : Processing in Network-on-Chip
  • 2023
  • Ingår i: IEEE design & test. - : Institute of Electrical and Electronics Engineers (IEEE). - 2168-2356 .- 2168-2364. ; 40:6, s. 30-38
  • Tidskriftsartikel (refereegranskat)abstract
    • Editor ’s notes: The author in this article advocates for Processing in NoC (PiN) as a means to actively engage a Network-on-Chip (NoC) in computation. The article highlights the benefits of utilizing the communication network for system-level performance enhancement, with a case study demonstrating its advantages over conventional passive NoC approaches. —Mahdi Nikdast, Colorado State University, USA —Miquel Moreto, Barcelona Supercomputing Center, Spain —Masoumeh (Azin) Ebrahimi, KTH Royal Institute of Technology, Sweden —Sujay Deb, IIIT Delhi, India
  •  
28.
  • Lu, Zhonghai, et al. (författare)
  • Remaining useful lifetime estimation for discrete power electronic devices using physics-informed neural network
  • 2023
  • Ingår i: Scientific Reports. - : Springer Nature. - 2045-2322. ; 13:1
  • Tidskriftsartikel (refereegranskat)abstract
    • Estimation of Remaining Useful Lifetime (RUL) of discrete power electronics is important to enable predictive maintenance and ensure system safety. Conventional data-driven approaches using neural networks have been applied to address this challenge. However, due to ignoring the physical properties of the target RUL function, neural networks can result in unreasonable RUL estimates such as going upwards and wrong endings. In the paper, we apply the fundamental principle of Physics-Informed Neural Network (PINN) to enhance Recurrent Neural Network (RNN) based RUL estimation methods. Through formulating proper constraints into the loss function of neural networks, we demonstrate in our experiments with the NASA IGBT dataset that PINN can make the neural networks trained more realistically and thus achieve performance improvements in estimation error and coefficient of determination. Compared to the baseline vanilla RNN, our physics-informed RNN can improve Mean Squared Error (MSE) of out-of-sample estimation on average by 24.7% in training and by 51.3% in testing; Compared to the baseline Long Short Term Memory (LSTM, a variant of RNN), our physics-informed LSTM can improve MSE of out-of-sample estimation on average by 15.3% in training and 13.9% in testing.
  •  
29.
  • Lu, Zhonghai, et al. (författare)
  • RUL Estimation for Power Electronic Devices Based on LESIT Equation
  • 2023
  • Ingår i: 2023 PROGNOSTICS AND HEALTH MANAGEMENT CONFERENCE, PHM. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 47-54
  • Konferensbidrag (refereegranskat)abstract
    • The LESIT equation is a well-known model to estimate the lifetime of power electronic devices. However, the equation is a static lifetime model that can estimate the total lifetime but not the dynamic lifetime over time and the Remaining Useful Lifetime (RUL), because the equation is not related to time or cycle count. In the paper, we first introduce the concept of dynamic lifetime, i.e., lifetime over time, and include time into the equation to allow it to calculate dynamic lifetime, and then propose a simple equation to conduct the RUL estimation assuming linear damage accumulation. In our experiments using aggregated aging tests, we show that the proposed RUL estimation method can fully capture the general linear decreasing trend of RUL, and in most cases, it gives very accurate estimates, where the deviation depends on the accuracy of the original LESIT estimation.
  •  
30.
  • Lu, Zhonghai, et al. (författare)
  • Wearable pressure sensing for lower limb amputees
  • 2022
  • Ingår i: BioCAS 2022 - IEEE Biomedical Circuits and Systems Conference. - : Institute of Electrical and Electronics Engineers Inc.. - 9781665469173 ; , s. 105-109, s. 105-109
  • Konferensbidrag (refereegranskat)abstract
    • Pressure sensing in prosthetic sockets is valuable as it provides quantified data to assist prosthetists in designing comfortable sockets for amputees. We present a wearable pressure sensing system for lower limb amputees. The full system consists of three essential elements from sensing scheme (wearable sensors, sensor calibration and deployment), electronic measurement system (embedded hardware and software), to time-series database and visualization. The full system has been successfully applied in clinical trials to effectively collect pressure data in real-time.
  •  
31.
  • Ma, Ruixian, et al. (författare)
  • BlockHammer : Improving Flash Reliability by Exploiting Process Variation Aware Proactive Failure Prediction
  • 2020
  • Ingår i: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. - : Institute of Electrical and Electronics Engineers Inc.. - 0278-0070 .- 1937-4151. ; , s. 1-1
  • Tidskriftsartikel (refereegranskat)abstract
    • NAND flash-based storage devices have gained a lot of popularity in recent years. Unfortunately, flash blocks suffer from limited endurance. For guaranteeing flash reliability, flash manufactures also prescribe a specified number of Program and Erase (P/E) cycles to define the endurance of flash blocks within the same chip. To extend the service lifetime of a flash-based device, existing works also assume that flash blocks have the same endurance and take P/E based wear-leveling algorithms which evenly distribute P/E cycle across flash blocks in the controller. However, many studies indicate flash blocks exhibit a wide endurance difference due to the fabrication process. The endurance of flash blocks is limited by the weakest block. Thus, the traditional P/E-based block retirement mechanism makes flash blocks underutilized. To best excavate the endurance of all blocks and improve the reliability of flash devices, we present BlockHammer, a process variation aware proactive failure prediction scheme. BlockHammer takes process variation and blocks similarity into consideration, it consists of a block classifier and a block lifetime predictor. Using machine learning technology, we first establish a block classifier to classify flash blocks into different classes. Based on the classification results, we then establish the block lifetime prediction model for different classes. Flash blocks belonging to the same class is assigned the same model. To verify the effectiveness of BlockHammer, we collect block data from a real NAND flash-based testing platform by emulating the true application scenario of NAND flash. We compare the predicted value and the tested value, the experimental results show the proposed proactive failure scheme can achieve more than 92% accuracy for flash blocks. Therefore, the block failure point can be accurately predicted using BlockHammer in advance, which greatly enhance the reliability of NAND flash. IEEE
  •  
32.
  • Malekzadeh, Elaheh, et al. (författare)
  • The Impact of Faults on DNNs : A Case Study
  • 2021
  • Ingår i: 2021 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT). - : Institute of Electrical and Electronics Engineers (IEEE).
  • Konferensbidrag (refereegranskat)abstract
    • Deep neural networks (DNNs) are showing superior advantages in different domains and are opening their path into critical applications where reliability is the main concern. DNNs can be executed in different hardware platforms, including general-purpose processors which usually operate under floating-point (FP) numbering systems. Considering the small range of weights in DNNs stored in the FP format, some bits remain constant as 0 or 1 for all weights. On the other hand, a single event upset may flip a bit, increasing or decreasing the value of a weight. In this paper, we analyze the effect of bit flips in a sample network of LeNet5, and show the sensitivity of convolution layers to faults and the vulnerability of DNNs to a single fault in a specific bit position. This is while the network is inherently robust against bit flips in the other bit positions. We then show that the choice of activation functions and pooling techniques could alleviate the negative effects of faults to a large extend.
  •  
33.
  • Qin, Zidi, et al. (författare)
  • A Novel Approximation Methodology and Its Efficient VLSI Implementation for the Sigmoid Function
  • 2020
  • Ingår i: IEEE Transactions on Circuits and Systems - II - Express Briefs. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-7747 .- 1558-3791. ; 67:12, s. 3422-3426
  • Tidskriftsartikel (refereegranskat)abstract
    • In this brief, a novel approximation method and its optimized hardware implementation are proposed for the sigmoid function used in Deep Neural Networks (DNNs). Based on piecewise approximation and truncated Taylor series expansion, the proposed method achieves very good approximation with low complexity while exploiting data representation with powers of two. In addition, by analyzing gradients of the sigmoid function, a small trick is introduced to improve the approximation precision. Furthermore, to reduce the hardware complexity and shorten the critical path, sampled values of the function are generated with simple logical-mapping. It is shown that the proposed approximation schemes can be implemented with purely combinational logic and the sigmoid function can be computed in one clock cycle. The experimental results demonstrate that the mean absolute errors are at the order of 1 x 10(-3). Compared with prior arts, the new design can obtain significant improvement in critical path with comparable performance.
  •  
34.
  • Qin, Zidi, et al. (författare)
  • A Universal Approximation Method and Optimized Hardware Architectures for Arithmetic Functions Based on Stochastic Computing
  • 2020
  • Ingår i: IEEE Access. - : Institute of Electrical and Electronics Engineers (IEEE). - 2169-3536. ; 8, s. 46229-46241
  • Tidskriftsartikel (refereegranskat)abstract
    • Stochastic computing (SC) has been applied on the implementations of complex arithmetic functions. Complicated polynomial-based approximations lead to large hardware complexity of previous SC circuits for arithmetic functions. In this paper, a novel piecewise approximation method based on Taylor series expansion is proposed for complex arithmetic functions. Efficient implementations based on unipolar stochastic logic are presented for the monotonic functions. Furthermore, detailed optimization schemes are provided for the non-monotonic functions. Using NAND and AND gates as main computing elements, the optimized hardware architectures have extremely low complexity. The experimental results show that a broad range of arithmetic functions can be implemented with the proposed SC circuits, and the mean absolute errors can achieve the order of 1 x 10(-3). Compared with the state-of-the-art works, the approximation precision for some typical functions can be increased by more than 8x with our method. In addition, the proposed circuits outperform the previous methods in hardware complexity and critical path significantly.
  •  
35.
  • Sadou, Isma-Ilou, et al. (författare)
  • Inference Time Reduction of Deep Neural Networks on Embedded Devices : A Case Study
  • 2022
  • Ingår i: 2022 25Th Euromicro Conference On Digital System Design (DSD). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 205-213
  • Konferensbidrag (refereegranskat)abstract
    • From object detection to semantic segmentation, deep learning has achieved many groundbreaking results in recent years. However, due to the increasing complexity, the execution of neural networks on embedded platforms is greatly hindered. This has motivated the development of several neural network minimisation techniques, amongst which pruning has gained a lot of focus. In this work, we perform a case study on a series of methods with the goal of finding a small model that could run fast on embedded devices. First, we suggest a simple, but effective, ranking criterion for filter pruning called Mean Weight. Then, we combine this new criterion with a threshold-aware layer-sensitive filter pruning method, called T-sensitive pruning, to gain high accuracy. Further, the pruning algorithm follows a structured filter pruning approach that removes all selected filters and their dependencies from the DNN model, leading to less computations, and thus low inference time in lower-end CPUs. To validate the effectiveness of the proposed method, we perform experiments on three different datasets (with 3, 101, and 1000 classes) and two different deep neural networks (i.e., SICK-Net and MobileNet V1). We have obtained speedups of up to 13x on lower-end CPUs (Armv8) with less than 1% drop in accuracy. This satisfies the goal of transferring deep neural networks to embedded hardware while attaining a good trade-off between inference time and accuracy.
  •  
36.
  • Shen, Sirui, et al. (författare)
  • A Hierarchical Parallel Discrete Gaussian Sampler for Lattice-Based Cryptography
  • 2022
  • Ingår i: 2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 1729-1733
  • Konferensbidrag (refereegranskat)abstract
    • Discrete Gaussian sampling is one of the important components in lattice-based cryptosystems which are promising candidates for post-quantum cryptographic algorithms. For sufficient security and satisfactory performance, the Knuth-Yao algorithm is an efficient way to implement discrete Gaussian samplers. Nevertheless, most polynomials in lattice-based cryptography have 256 coefficients or more, which suffers from long latency to complete the sample generation. In this paper, the first parallel discrete Gaussian sampler with hierarchical structure is proposed, while keeping statistical distance to the actual distribution. Based on the imbalanced visiting frequency of the probability matrix, a three-stage generation strategy is adopted with hierarchical bit search units (BSUs) that can greatly reduce area consumption of the repeated costly lookup tables. Besides the architecture improvement, a lowest-set-bit scanning scheme is introduced to BSUs. Moreover, the parallelism of our design provides obfuscation ability against side-channel attacks (SCAs). A practical hardware implementation of discrete Gaussian distributions with sigma = 3.33 on the Xilinx Virtex-5 XC5VLX30 FPGA device spends 26.12 ns on average to generate 256 samples, consuming 994 slices. Results have verified its advantages of area efficiency over the state-of-the-arts (SOAs).
  •  
37.
  • Song, Wenqing, et al. (författare)
  • Heterogeneous Reconfigurable Accelerator for Homomorphic Evaluation on Encrypted Data
  • 2024
  • Ingår i: IEEE Access. - : Institute of Electrical and Electronics Engineers (IEEE). - 2169-3536. ; 12, s. 11850-11864
  • Tidskriftsartikel (refereegranskat)abstract
    • Homomorphic encryption (HE) enables third -party servers to perform computations on encrypted user data while preserving privacy. Although conceptually attractive, the speed of software implementations of HE is almost impractical. To address this challenge, various domain -specific architectures have been proposed to accelerate homomorphic evaluation, but efficiency remains a bottleneck. In this paper, we propose a homomorphic evaluation accelerator with heterogeneous reconfigurable modular computing units (RCUs) for the Brakerski/Fan-Vercauteren (BFV) scheme. RCUs leverage operator abstraction to efficiently perform basic sub -operations of homomorphic evaluation such as residue number system (RNS) conversion, number theoretic transform (NTT), and other modular computations. By combining these sub -operations, complex homomorphic evaluation operations like multiplication, rotation, and addition are efficiently executed. To address the high demand for data access and improve memory efficiency, we design a coordinate -based address encoding strategy that enables in -place and conflict -free data access. Furthermore, specific optimizations are performed on the core sub -operations such as NTT and automorphism. The proposed architecture is implemented on Xilinx Virtex-7 and UltraScale+ FPGA platforms and evaluated for polynomials of length 4096. Compared to state-of-the-art accelerators with the same parameter set, our accelerator achieves the following advantages: 1) 2.04x to 3.33x reduction in the area -time product (ATP) for the key sub -operation NTT, 2) 1.08x to 7.42x reduction in latency for homomorphic multiplication with higher area efficiency, and 3) support for a wider range of homomorphic evaluation operations, including rotation, compared to other BFV-based accelerators.
  •  
38.
  • Su, Peng, et al. (författare)
  • Combining Self-Organizing Map with Reinforcement Learning for Multivariate Time Series Anomaly Detection
  • 2023
  • Ingår i: Proceedings 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). - : Institute of Electrical and Electronics Engineers Inc..
  • Konferensbidrag (refereegranskat)abstract
    • Anomaly detection plays a critical role in condition monitors to support the trustworthiness of Cyber-Physical Systems (CPS). Detecting multivariate anomalous data in such systems is challenging due to the lack of a complete comprehension of anomalous behaviors and features. This paper proposes a framework to address time series multivariate anomaly detection problems by combining the Self-Organizing Map (SOM) with Deep Reinforcement Learning (DRL). By clustering the multivariate data, SOM creates an environment to enable the DRL agents interacting with the collected system  operational data in terms of a tabular dataset. In this environment, Markov chains reveal the likely anomalous features to support the DRL agent exploring and exploiting the state-action space to maximize anomaly detection performance. We use a time series dataset, Skoltech Anomaly Benchmark (SKAB), to evaluate our framework. Compared with the best results by some currently applied methods, our framework improves the F1 score by 9%, from 0.67 to 0.73. 
  •  
39.
  • Wang, Boqian, et al. (författare)
  • Advance Virtual Channel Reservation
  • 2020
  • Ingår i: IEEE Transactions on Computers. - : Institute of Electrical and Electronics Engineers (IEEE). - 0018-9340 .- 1557-9956. ; 69:9, s. 1320-1334
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a smart communication service called Advance Virtual Channel Reservation (AVCR) to provide a highway to target packets, which can greatly reduce their contention delay in NoC. AVCR takes advantage of the fact that we can know or predict the destination of some packets ahead of their arrival at the network interface (NI). Exploiting the time interval before a packet is ready, AVCR establishes an end-to-end highway from the source NI to the destination NI. This highway is built up by reserving the virtual channel (VC) resources ahead of the target packet transmission and offering priority service to flits in the reserved VC in the wormhole router, which can avoid the target packets' VC allocation and switch arbitration delay. Additionally, optimization schemes are proposed to increase resources utilization and system performance. We evaluate AVCR with GEM5 full-system simulations by using 24 benchmarks in PARSEC and OMP2012. Compared to the state-of-art mechanisms and the priority-based mechanism, experimental results show that our mechanism can significantly reduce the target packets' transfer latency and thus effectively decrease the average region-of-interest (ROI) time by 18.1 percent (maximally by 29.4 percent) across all benchmarks.
  •  
40.
  • Wang, Boqian, et al. (författare)
  • Efficient Support of AXI4 Transaction Ordering Requirements in Many-Core Architecture
  • 2020
  • Ingår i: IEEE Access. - : Institute of Electrical and Electronics Engineers (IEEE). - 2169-3536. ; 8, s. 182663-182678
  • Tidskriftsartikel (refereegranskat)abstract
    • The Advanced Microcontroller Bus Architecture (AMBA) Advanced eXtensible Interface 4 (AXI4) protocol was initially a bus oriented interface designed for on-chip communication. To offer the possibility of utilizing the AXI4 based processors and peripherals in the on-chip network based system, we propose a whole system architecture solution to make the AXI4 protocol compatible with the Network-on-Chip (NoC) based communication interconnect in the many-core architecture. Due to the out-of-order transaction in the NoC interconnect, which conflicts with the ordering requirements specified by the AXI4 protocol, we especially focus on the design of the transaction ordering units, realizing a high-performance and low cost (area) solution to the ordering requirements by the sequence ID (seq_ID) reuse mechanism and a simple but smart seq_ID synchronization process. Besides, the micro-architectures and the functionalities of the transaction ordering units are described and explained in detail for ease of implementation. The experimental results in a C++ based system simulator show that, compared with the state-of-the-art works, our solution can maximally increase the system throughput by 66.0% and decrease the transaction queueing delay in the master-side ordering unit by 91.2%.
  •  
41.
  • Wang, B., et al. (författare)
  • Flexible and Efficient QoS Provisioning in AXI4-based Network-on-Chip Architecture
  • 2022
  • Ingår i: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. - : Institute of Electrical and Electronics Engineers (IEEE). - 0278-0070 .- 1937-4151. ; 41:5, s. 1523-1536
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose a Network-on-Chip (NoC)-based whole system design, whose communication architecture is compatible with the AMBA AXI4 protocol and supports high-performance multiple Quality-of-Service (QoS) schemes. In our system, the network interface (NI) between the NoC and the master/slave node is proposed to make the NoC architecture independent from the AXI4 protocol via message format conversion between the AXI4 signal format and the packet format, offering high flexibility to the NoC design. Besides, a QoS inheritance mechanism is applied in the slave-side NI to support QoS during packets’ round-trip transfer in the NoC. The NoC system contains Time Division Multiplexing (TDM) and Virtual Channel (VC) subnetworks to apply multiple QoS schemes to AXI4 signals with different QoS tags and the NI is responsible for signals distribution between two subnetworks. Besides, a traffic converter is proposed in each NI to balance the traffic between the two subnetworks when necessary. The experimental results show that our proposed architecture ensures a high-throughput and low-latency NoC system. By applying traffic converter, the packet latency can be improved. CCBY
  •  
42.
  • Wang, Boqian, 1990- (författare)
  • High-Performance Network-on-Chip Design for Many-Core Processors
  • 2020
  • Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • With the development of on-chip manufacturing technologies and the requirements of high-performance computing, the core count is growing quickly in Chip Multi/Many-core Processors (CMPs) and Multiprocessor System-on-Chip (MPSoC) to support larger scale parallel execution. Network-on-Chip (NoC) has become the de facto solution for CMPs and MPSoCs in addressing the communication challenge. In the thesis, we tackle a few key problems facing high-performance NoC designs.For general-purpose CMPs, we encompass a full system perspective to design high-performance NoC for multi-threaded programs. By exploring the cache coherence under the whole system scenario, we present a smart communication service called Advance Virtual Channel Reservation (AVCR) to provide a highway to target packets, which can greatly reduce their contention delay in NoC. AVCR takes advantage of the fact that we can know or predict the destination of some packets ahead of their arrival at the Network Interface (NI). Exploiting the time interval before a packet is ready, AVCR establishes an end-to-end highway from the source NI to the destination NI. This highway is built up by reserving the Virtual Channel (VC) resources ahead of the target packet transmission and offering priority service to flits in the reserved VC in the wormhole router, which can avoid the target packets’ VC allocation and switch arbitration delay. Besides, we also propose an admission control method in NoC with a centralized Artificial Neural Network (ANN) admission controller, which can improve system performance by predicting the most appropriate injection rate of each node using the network performance information. In the online control process, a data preprocessing unit is applied to simplify the ANN architecture and make the prediction results more accurate. Based on the preprocessed information, the ANN predictor determines the control strategy and broadcasts it to each node where the admission control will be applied.For application-specific MPSoCs, we focus on developing high-performance NoC and NI compatible with the common AMBA AXI4 interconnect protocol. To offer the possibility of utilizing the AXI4 based processors and peripherals in the on-chip network based system, we propose a whole system architecture solution to make the AXI4 protocol compatible with the NoC based communication interconnect in the many-core system. Due to possible out-of-order transmission in the NoC interconnect, which conflicts with the ordering requirements specified by the AXI4 protocol, in the first place, we especially focus on the design of the transaction ordering units, realizing a high-performance and low cost solution to the ordering requirements. The microarchitectures and the functionalities of the transaction ordering units are also described and explained in detail for ease of implementation. Then, we focus on the NI and the Quality of Service (QoS) support in NoC. In our design, the NI is proposed to make the NoC architecture independent from the AXI4 protocol via message format conversion between the AXI4 signal format and the packet format, offering high flexibility to the NoC design. The NoC based communication architecture is designed to support high-performance multiple QoS schemes. The NoC system contains Time Division Multiplexing (TDM) and VC subnetworks to apply multiple QoS schemes to AXI4 signals with different QoS tags and the NI is responsible for traffic distribution between two subnetworks. Besides, a QoS inheritance mechanism is applied in the slave-side NI to support QoS during packets’ round-trip transfer in NoC.
  •  
43.
  • Wang, Boqian, et al. (författare)
  • Supporting QoS in AXI4 based Communication Architecture
  • 2020
  • Ingår i: 2020 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2020). - : IEEE. ; , s. 548-553
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we propose a NoC based whole system design, the communication architecture of which is compatible with the AMBA AXI4 protocol and supports high-performance multiple QoS schemes. In our system, the network interface (NI) between the NoC and the master/slave node is proposed to make the NoC architecture independent from the AXI4 protocol via message format conversion between the AXI4 signal format and the packet format, offering high flexibility to the NoC design. Besides, a QoS inheritance mechanism is applied in the slave-side NI to support QoS during packets' round-trip transfer in the NoC. The NoC system contains TDM and VC subnetworks to apply multiple QoS schemes to AXI4 signals with different QoS tags and the NI is responsible for traffic distribution between two subnetworks. The experimental results show that our proposed system architecture can achieve good performance and satisfy different QoS needs.
  •  
44.
  • Wang, J., et al. (författare)
  • Optimal Sprinting Pattern in Thermal Constrained CMPs
  • 2021
  • Ingår i: IEEE Transactions on Emerging Topics in Computing. - : IEEE Computer Society. - 2168-6750.
  • Tidskriftsartikel (refereegranskat)abstract
    • CS (Computational Sprinting) is a promising technique to tackle the thermal challenge in CMPs (Chip Multi-Processors). Sprinting pattern, the boosted chip and voltage during the sprinting time, greatly impacts the CMP performance. In the paper, we address how to find out the optimal sprinting pattern which maximizes the performance of CMPs within thermal limitation. First, we conduce a mathematical proof to show that any thermal-constrained CMP, when it executes an application, has a specialized, sustainable configuration (vo; fo), under which the CMP can keep sprinting without resting and meanwhile its performance is maximized. Then, we design a self-adaptive algorithm automatically altering the chip frequency with adjustable step size and voltage in runtime to reach the optimal value. Finally, our extensive experimental results reveal that our Optimal Sprinting Pattern (OSP) outperforms state-of-the-art sprinting techniques, Full Sprinting Policy (FSP) and Adaptive Sprinting Pacing (ASP). Specifically, our OSP improves the computational efficiency in MIPS by up to 59% against FSP and 40% against ASP. It also achieves higher energy efficiency in MIPJ, by up to 41% and 25% over FSP and ASP, respectively. Moreover, we demonstrate that our method is effective for various CMPs with different scales, CPU architectures and chip nano-technologies. 
  •  
45.
  • Wang, Yu, et al. (författare)
  • FlexZNS : Building High-Performance ZNS SSDs with Size-Flexible and Parity-Protected Zones
  • 2023
  • Ingår i: Proceedings - 2023 IEEE 41st International Conference on Computer Design, ICCD 2023. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 291-299
  • Konferensbidrag (refereegranskat)abstract
    • NVMe zoned namespace (ZNS) SSDs present a new class of storage devices with attractive features including low cost, software definability, and stable performance. However, one primary culprit that hinders the adoption of ZNS is the high garbage collection (GC) overhead it brings to host software. The ZNS interface divides the logical address space into size-fixed zones that must be written sequentially. Despite being friendly to flash memory, ZNS requires host software to perform out-of-place updates and GC on individual zones. Current ZNS SSDs typically employ a large zone size (e.g., of GBs) to be conducive to die-level RAID protection on flash memory. This impedes flexible data placement, such as mixing data with different lifetimes in the same zone, and incurs sizable data migrations during zone GC. To address this problem, we propose FlexZNS, a novel ZNS SSD design that provides reliable zoned storage allowing host software to configure the zone size flexibly as well as multiple zone sizes. The size variability of zones poses two interrelated challenges, one for the SSD controller to establish per-zone RAID protection, and the other for host software to manage variable zone capacity loss caused by parity storage. To tackle the challenges, FlexZNS decouples the storage of parity from individual zones on flash memory and hides the zone capacity loss from the host software. We verify FlexZNS on a ZNS-compatible file system F2FS and a popular key-value store RocksDB. Extensive experiments demonstrate that FlexZNS can significantly improve the system performance and reduce GC-induced write amplification, compared with a conventional ZNS SSD with large-sized zones.
  •  
46.
  • Wang, Yu, et al. (författare)
  • Holistic and Opportunistic Scheduling of Background I/Os in Flash-Based SSDs
  • 2023
  • Ingår i: IEEE Transactions on Computers. - : Institute of Electrical and Electronics Engineers (IEEE). - 0018-9340 .- 1557-9956. ; 72:11, s. 3127-3139
  • Tidskriftsartikel (refereegranskat)abstract
    • Background (BG) tasks are maintained indispensably in multiple layers of storage systems, from applications to flash-based SSDs. They launch a large amount of I/Os, causing significant interference with foreground (FG) I/O performance. Our key insight is that, to mitigate such interference, holistic scheduling of system-wide, multi-source BG I/Os is required and can only be realized at the underlying SSD layer. Only the SSD has a global view of all FG and BG I/Os as well as direct information and control about flash storage resources. We are thus inspired to propose a novel I/O scheduling architecture, called HuFu. It provides a framework for host software to register BG tasks and offload their I/O scheduling into the SSD. Then, the SSD-internal I/O scheduler prioritizes FG I/O processing, while BG I/Os are scheduled opportunistically by utilizing flash parallelism and idleness. To verify HuFu, we perform case studies on RocksDB and compares it with several state-of-the-art host-side I/O scheduling schemes. Experimental results show that HuFu can significantly alleviate performance interference caused by BG I/Os and improve SSD bandwidth utilization, thus improving the FG throughput, average and tail latencies (e.g., by about 18% in a write-heavy workload).
  •  
47.
  • Yang, Yu (författare)
  • High-Level Synthesis for SiLago : Advances in Optimization of High-Level Synthesis Tool and Neural Network Algorithms
  • 2022
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Embedded hardware designs and their automation improve energy and engineering efficiency. However, these two goals are often contradictory. The attempts to improve energy efficiency often come at the cost of engineering efficiency and vice-versa. High-level synthesis (HLS) is a good example of this challenge. It has been researched for more than three decades. Nevertheless, it has not become a mainstream design flow component concerning custom hardware synthesis due to the big efficiency gap between the HLS-generated hardware design and the manual RTL design.This thesis attempts to address the HLS challenge. We divide the research challenge of improving state-of-the-art HLS into three components: 1) the hardware architecture and its underlying VLSI design style, 2) the design automation algorithms and data structures, and 3) the optimization of the algorithm to be mapped.The SiLago hardware platform has been reported as a prominent hardware architecture that can deliver ASIC-like efficiency and could be an ideal HLS hardware platform. It has the following features: 1) SiLago embodies parallel distributed two-level control. 2) SiLago blocks are hardened blocks that can create valid VLSI designs by abutment without involving logic or physical synthesis.Consequently, when targeting the SiLago hardware platform, the SiLago HLS tool generates not a single controller but multiple collaborative controllers, each of which is a hierarchy of two levels. The distributed two-level control scheme poses unique challenges in synchronization and scheduling tasks. Unique data structures and instruction scheduling models are developed for the SiLago HLS tool to support the distributed two-level control scheme. The SiLago HLS tool also generates a valid GDSII macro whose average energy, area, and performance are not estimated but known with post-layout accuracy thanks to the predictable SiLago hardware blocks. Moreover, the SiLago HLS tool is not intended for the end-user. It is designed to develop a library of algorithm implementations used by the application-level synthesis (ALS) tool in the SiLago framework. The application is defined as a hierarchy of algorithms. This library would include algorithms that vary in their function, dimension, and degree of parallelism. The ALS tool explores the design space in terms of number and type of algorithm implementation, rather than arithmetic resources, as HLS tools do.Algorithms are often developed by domain experts. For efficient implementation in hardware, such algorithms often need to be optimized with the hardware platform in mind. Two algorithm instances have been chosen for demonstration purposes. The first instance is a self-organizing map (SOM) based genome recognition algorithm. The second example concerns a complex model of cortex called Bayesian confidence propagation neural network (BCPNN). As developed by computational neuroscientists, the original model demands too much memory storage and memory access.This thesis addresses the latter two components because the first component has been addressed in the literature. We will first demonstrate the design of the SiLago HLS tool to support the hardware features like the distributed two-level control system. Moreover, we will use the two complex algorithm instances -- SOM and BCPNN, to demonstrate both general-purpose and algorithm-specific hardware-oriented algorithm optimization techniques. With the research carried out in this thesis, the SiLago HLS framework is greatly improved.
  •  
48.
  • Yao, Yuan, 1986-, et al. (författare)
  • Pursuing Extreme Power Efficiency With PPCC Guided NoC DVFS
  • 2020
  • Ingår i: IEEE Transactions on Computers. - : IEEE COMPUTER SOC. - 0018-9340 .- 1557-9956. ; 69:3, s. 410-426
  • Tidskriftsartikel (refereegranskat)abstract
    • In sharp contrast to conventional performance indicative based Network-on-Chip (NoC) DVFS, where the direct relation between application performance and NoC power consumption is missing, we exploit the concept of Performance-Power Characteristic Curve (PPCC) newly proposed in the literature to approach maximum NoC power efficiency. PPCC, which defines the direct relation between application performance and NoC power consumption, consists of three distinct regions: an inertial region due to power under-provisioning, a linear region for proportional performance gain, and a saturation region due to power over-provisioning. With PPCC as a guidance, we propose Delta-DVFS, which employs a "profile-then-select" strategy to step-by-step approach maximum NoC power efficiency. Delta-DVFS is built on two observations. First, in multi-threaded applications, maximum NoC power efficiency is achieved at the boundary between the linear region and the saturation region on the PPCC. Second, PPCC stabilizes when threads repeat workloads of the same loop. This is intuitively meaningful because loop repetition stresses NoC with similar workload. Based on the observations, Delta-DVFS uses the first several loop iterations for PPCC profiling. After the profiling is done, Delta-DVFS selects and applies the optimal V/F that achieves maximum NoC power efficiency to the remaining loop iterations. To accurately and timely follow PPCC when threads proceed to different loops, Delta-DVFS utilizes an H-tree loop monitor to detect loop change among distributive threads.
  •  
49.
  • Yu, Yang, 1991- (författare)
  • Design and Security Analysis of TRNGs and PUFs
  • 2022
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • True Random Number Generators (TRNGs) and Physical Unclonable Functions (PUFs) are two important types of cryptographic primitives. TRNGs create a hardware-based, non-deterministic noise that is often used for generating keys, initialization vectors, and nonces for various applications that require cryptographic protection. PUFs have been proposed as a tamper-resistant alternative to the traditional secret key generation and challenge-response authentication methods. A compromised TRNG or PUF can lead to a system-wide loss of security.The conventional TRNG or PUF designs are challenged by new attack vectors such as deep learning-based side-channel analysis. In this dissertation, we propose several new PUF and TRNG designs and evaluations of their performance and security.The first PUF we introduce is called threshold PUF. We show that, in principle, any n-input threshold logic gate can be used as a base for building an n-input PUF. We implement and evaluate a threshold PUF based on recently proposed threshold logic flip-flops using SPICE simulation as a proof of concept. Threshold PUFs open up the possibility of using the rich body of knowledge on threshold logic implementations for designing PUFs. The second proposed design is a lightweight PUF construction called CRC-PUF, which focuses on protecting PUFs against machine learning-based modeling attacks. In CRC-PUF, input challenges are de-synchronized from output responses to make the PUF model difficult to learn. The input transformation which does the de-synchronization is based on a Cyclic Redundancy Check (CRC), thus the name CRC-PUF. By changing the CRC generator polynomial for each new response, we assure that recovering the transforming challenge has a success probability of at most 2-86 for 128-bit challenge-response pairs.The first TRNG design we introduce is based on a Non-Linear Feedback Ring Oscillator (NLFRO). The proposed NLFRO-TRNG structure harvests randomness from noise and unpredictable variations in delay cells and bi-stable elements, which is further amplified by the formation of non-linear feedback loops. The NLFRO outputs have chaotic behavior, allowing the construction of TRNGs with high entropy and speed. We implement three NLFRO-TRNGs on FPGA and evaluate the properties of the implementations with the NIST 800-90B entropy estimation and NIST 800-22 statistical test suits. The second proposed TRNG design is based on a strong PUF. The PUF based TRNG exploits the inherent determinism of PUF to enable in-field testing of the entropy sources by known answer tests. We present a prototype FPGA implementation of the proposed TRNG based on an arbiter PUF that passes all NIST 800-22 statistical tests and has the minimal entropy of 0.918 estimated according to NIST 800-90B recommendations.Apart from TRNG and PUF designs, it is crucial to consider potential attack vectors that can be created leveraging recently emerged technologies. To that end, in the second part of this dissertation, we introduce a novel attack on FPGA-based PUF and TRNG implementations that combines bitstream modification along with deep learning-based side-channel analysis. We evaluate this new attack vector on the design of an arbiter PUF and a ring oscillator-based TRNG implemented on Xilinx Artix-7 28nm FPGAs. In both cases, we are able to achieve close to 100% classification accuracy to recover the output or response. In the case of the arbiter PUF, the attack can even overcome countermeasures that are based on encrypting the challenges or responses.With such potent attack vectors readily available, the construction of strong countermeasures is necessary. Unfortunately, many of the state-of-the-art countermeasures are one-sided. In the final part of the dissertation, we use a countermeasure proposed for the protection of the Advanced Encryption Standard as an example. We conduct experiments and conclude that it can assist another type of side-channel attack that is not considered by the countermeasure.
  •  
50.
  • Zhang, Y., et al. (författare)
  • Base-2 Softmax Function : Suitability for Training and Efficient Hardware Implementation
  • 2022
  • Ingår i: IEEE Transactions on Circuits and Systems Part 1. - : Institute of Electrical and Electronics Engineers (IEEE). - 1549-8328 .- 1558-0806. ; 69:9, s. 3605-3618
  • Tidskriftsartikel (refereegranskat)abstract
    • The softmax function is widely used in deep neural networks (DNNs), its hardware performance plays an important role in the training and inference of DNN accelerators. However, due to the complexity of the traditional softmax, the existing hardware architectures are resource-consuming or have low precision. In order to address the challenges, we study a base-2 softmax function in terms of its suitability for neural network training and efficient hardware implementation. Compared to the classical base- e softmax function, the base-2 softmax function is a new softmax function that uses 2 as the exponential base instead of e. From the aspects of mathematical derivation and software simulation, we first demonstrate the feasibility and good accuracy of the base-2 softmax function in the application of neural network training. Then, we use the symmetric-mapping lookup table (SM-LUT) method to design a low-complexity architecture but with high precision to implement it. Under TSMC 28nm CMOS technology, an example design of our architecture has the area of 5676 μ m2 and the power consumption of 13.12 mW for circuit synthesis at the frequency of 3 GHz. Compared with the latest works, our architecture achieves the best performance and efficiency. 
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-50 av 54

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy