SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "(db:Swepub) pers:(Lu Zhonghai) conttype:(scientificother) srt2:(2015-2019)"

Sökning: (db:Swepub) pers:(Lu Zhonghai) conttype:(scientificother) > (2015-2019)

  • Resultat 1-9 av 9
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Badawi, Mohammad, 1981- (författare)
  • Adaptive Coarse-grain Reconfigurable Protocol Processing Architecture
  • 2016
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Digital signal processors and their variants have provided significant benefit to efficient implementation of Physical Layer (PHY) of Open Systems Interconnection (OSI) model’s seven-layer protocol processing stack compared to the general purpose processors. Protocol processors promise to provide a similar advantage for implementing higher layers in the (OSI)'s seven-layer model. This thesis addresses the problem of designing customizable coarse-grain reconfigurable protocol processing fabrics as a solution to achieving high performance and computational efficiency. A key requirement that this thesis addresses is the ability to not only adapt to varying applications and standards, and different modes in each standard but also to time varying load and performance demands while maintaining quality of service.This thesis presents a tile-based multicore protocol processing architecture that can be customized at design time to meet the requirements of the target application. The architecture can then be reconfigured at boot time and tuned to suit the desired use-case. This architecture includes a packet-oriented memory system that has deterministic access time and access energy costs, and hence can be accurately dimensioned to fulfill the requirements of the desired use-case. Moreover, to maintain quality of service as predicted, while minimizing the use of energy and resources, this architecture encompasses an elastic management scheme that controls run-time configuration to deploy processing resources based on use-case and traffic demands.To evaluate the architecture presented in this thesis, different case studies were conducted while quantitative and qualitative metrics were used for assessment. Energy-delay product, energy efficiency, area efficiency and throughput show the improvements that were achieved using the processing cores and the memory of the presented architecture, compared with other solutions. Furthermore, the results show the reduction in latency and power consumption required to evaluate controlling states when using the elastic management scheme. The elasticity of the scheme also resulted in reducing the total area required for the controllers that serve multiple processing cores in comparison with other designs. Finally, the results validate the ability of the presented architecture to support quality of service without misutilizing available energy during a real-life case study of a multi-participant Voice Over Internet Protocol (VOIP) call.
  •  
2.
  • Chen, Xiaowen, 1982- (författare)
  • Efficient Memory Access and Synchronization in NoC-based Many-core Processors
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In NoC-based many-core processors, memory subsystem and synchronization mechanism are always the two important design aspects, since mining parallelism and pursuing higher performance require not only optimized memory management but also efficient synchronization mechanism. Therefore, we are motivated to research on efficient memory access and synchronization in three topics, namely, efficient on-chip memory organization, fair shared memory access, and efficient many-core synchronization.One major way of optimizing the memory performance is constructing a suitable and efficient memory organization. A distributed memory organization is more suitable to NoC-based many-core processors, since it features good scalability. We envision that it is essential to support Distributed Shared Memory (DSM) because of the huge amount of legacy code and easy programming. Therefore, we first adopt the microcoded approach to address DSM issues, aiming for hardware performance but maintaining the flexibility of programs. Second, we further optimize the DSM performance by reducing the virtual-to-physical address translation overhead. In addition to the general-purpose memory organization such as DSM, there exists special-purpose memory organization to optimize the performance of application-specific memory access. We choose Fast Fourier Transform (FFT) as the target application, and propose a multi-bank data memory specialized for FFT computation.In 3D NoC-based many-core processors, because processor cores and memories reside in different locations (center, corner, edge, etc.) of different layers, memory accesses behave differently due to their different communication distances. As the network size increases, the communication distance difference of memory accesses becomes larger, resulting in unfair memory access performance among different processor cores. This unfair memory access phenomenon may lead to high latencies of some memory accesses, thus negatively affecting the overall system performance. Therefore, we are motivated to study on-chip memory and DRAM access fairness in 3D NoC-based many-core processors through narrowing the round-trip latency difference of memory accesses as well as reducing the maximum memory access latency.Barrier synchronization is used to synchronize the execution of parallel processor cores. Conventional barrier synchronization approaches such as master-slave, all-to-all, tree-based, and butterfly are algorithm oriented. As many processor cores are networked on a single chip, contended synchronization requests may cause large performance penalty. Motivated by this, different from the algorithm-based approaches, we choose another direction (i.e., exploiting efficient communication) to address the barrier synchronization problem. We propose cooperative communication as a means and combine it with the master-slave algorithm and the all-to-all algorithm to achieve efficient many-core barrier synchronization. Besides, a multi-FPGA implementation case study of fast many-core barrier synchronization is conducted.
  •  
3.
  •  
4.
  •  
5.
  • Ma, Ning, et al. (författare)
  • Design and Implementation of Multi-mode Routers for Large-scale Inter-core Networks
  • 2016
  • Ingår i: Integration. - : Elsevier. - 0167-9260 .- 1872-7522. ; 53, s. 1-13
  • Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract
    • Constructing on-chip or inter-silicon (inter-die/inter-chip) networks to connect multiple processors extends the system capability and scalability. It is a key issue to implement a flexible router that can fit into various application scenarios. This paper proposes a multi-mode adaptable router that can support both circuit and wormhole switching with supplying flexible working strategies for specific traffic patterns in diverse applications. The limitation of mono-mode switched routers is shown at first, followed by algorithm exploration in the proposed router for choosing the proper working strategy in a specific network. We then present the performance improvement when applying the mixed circuit/wormhole switching mode to different applications, and analyze the image decoding as a case study. The multi-mode router has been implemented with different configurations in a 65 nm CMOS technology. The one with 8-bit flit width is demonstrated together with a multi-core processor to show the feasibility. Working at 350 MHz, the average power consumption of the whole system is 22 mW.
  •  
6.
  • Ma, Ning (författare)
  • Ultra-low-power Design and Implementation of Application-specific Instruction-set Processors for Ubiquitous Sensing and Computing
  • 2015
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • The feature size of transistors keeps shrinking with the development of technology, which enables ubiquitous sensing and computing. However, with the break down of Dennard scaling caused by the difficulties for further lowering supply voltage, the power density increases significantly. The consequence is that, for a given power budget, the energy efficiency must be improved for hardware resources to maximize the performance. Application-specific integrated circuits (ASICs) obtain high energy efficiency at the cost of low flexibility for various applications, while general-purpose processors (GPPs) gain generality at the expense of efficiency.To provide both high energy efficiency and flexibility, this dissertation explores the ultra-low-power design of application-specific instruction-set processors (ASIP) for ubiquitous sensing and computing. Two application scenarios, i.e. high-throughput compute-intensive processing for multimedia and low-throughput low-cost processing for Internet of Things (IoT) are implemented in the proposed ASIPs.Multimedia stream processing for human-computer interaction is always featured with high data throughput. To design processors for networked multimedia streams, customizing application-specific accelerators controlled by the embedded processor is exploited. By abstracting the common features from multiple coding algorithms, video decoding accelerators are implemented for networked multi-standard multimedia stream processing. Fabricated in 0.13 $\mu$m CMOS technology, the processor running at 216 MHz is capable of decoding real-time high-definition video streams with power consumption of 414 mW.When even higher throughput is required, such as in multi-view video coding applications, multiple customized processors will be connected with an on-chip network. Design problems are further studied for selecting the capability of single processors, the number of processors, the capacity of communication network, as well as the task assignment schemes.In the IoT scenario, low processing throughput but high energy efficiency and adaptability are demanded for a wide spectrum of devices. In this case, a tile processor including a multi-mode router and dual cores is proposed and implemented. The multi-mode router supports both circuit and wormhole switching to facilitate inter-silicon extension for providing on-demand performance. The control-centric dual-core architecture uses control words to directly manipulate all hardware resources. Such a mechanism avoids introducing complex control logics, and the hardware utilization is increased. Programmable control words enable reconfigurability of the processor for supporting general-purpose ISAs, application-specific instructions and dedicated implementations. The idea of reducing global data transfer also increases the energy efficiency. Finally, a single tile processor together with network of bare dies and network of packaged chips has been demonstrated as the result. The processor implemented in 65 nm low leakage CMOS technology and achieves the energy efficiency of 101.4 GOPS/W for each core.
  •  
7.
  • Shaoteng, Liu, 1984- (författare)
  • New circuit switching techniques in on-chip networks
  • 2015
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Network on Chip (NoC) is proposed as a promising technology to address the communication challenges in deep sub-micron era. NoC brings network-based communication into the on-chip environment and tackles the problems like long wire complexities, bandwidth scaling and so on. After more than a decade's evolution and development, there are many NoC architectures and solutions available. Nevertheless, NoCs can be classi_ed into two categories: packet switched NoC and circuit switched NoC. In this thesis, targeting circuit switched NoC, we present our innovations and considerations on circuit switched NoCs in three areas, namely, connection setup method, time division multiplexing (TDM) technology and spatial division multiplexing (SDM) technology.Connection setup technique deeply inuences the architecture and performance of a circuit switched NoC, since circuit switched NoC requires to set up connections before launching data transfer. We propose a novel parallel probe based method for dynamic distributed connection setup. This setup method on one hand searches all the possible minimal paths in parallel. On the other hand, it also has a mechanism to reduce resource occupation during the path search process by reclaiming redundant paths. With this setup method, connections are more likely to be established because of the exploration on the path diversity.TDM based NoC constitutes a sub-category of circuit switched NoC. We propose a double time-wheel technique to facilitate a probe based connection setup in TDM NoCs. With this technique, path search algorithms used in connection setup are no longer limited to deterministic routing algorithms. Moreover, the hardware cost can be reduced, since setup requests and data flows can co-exist in one network. Apart from the double time-wheel technique for connection setup, we also propose a highway technique that can enhance the slot utilization during data transfer. This technique can accelerate the transfer of a data flow while maintaining the throughput guarantee and the packet order.SDM based NoC constitutes another sub-category of circuit switched NoC. SDM NoC can benefit from high clock frequency and simple synchronization efforts. To better support the dynamic connection setup in SDM NoCs, we design a single cycle allocator for channel allocation inside each router. This allocator can guarantee both strong fairness and maximal matching quality. We also build up a circuit switched NoC, which can support multiple channels and multiple networks, to study different ways of organizing channels and setting up connections. Finally, we make a comparison between circuit switched NoC and packet switched NoC. We show the strengths and weaknesses on each of them by analysis and evaluation.
  •  
8.
  • Yao, Yuan, 1986- (författare)
  • Power and Performance Optimization for Network-on-Chip based Many-Core Processors
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Network-on-Chip (NoC) is emerging as a critical shared architecture for CMPs (Chip Multi-/Many-Core Processors) running parallel and concurrent applications. As the core count scales up and the transistor size shrinks, how to optimize power and performance for NoC open new research challenges.As it can potentially consume 20--40\% of the entire chip power, NoC power efficiency has emerged as one of the main design constraints in today's and future high performance CMPs. For NoC power management, we propose a novel on-chip DVFS technique that is able to adjust per-region NoC V/F according to voted V/F levels from communicating threads. A thread periodically votes for a preferred NoC V/F level that best suits its individual performance interests. The final DVFS decision of each region is adjusted by a region DVFS controller democratically based on the majority of votes it receives.Mutually exclusive locks are pervasive shared memory synchronization primitives. In advanced locks such as the Linux queue spinlock comprising a low-overhead spinning phase and a high-overhead sleeping phase, we show that the lock primitive may create very high competition overhead (COH), which is the time threads compete with each other for the next critical section grant. For performance enhancement, we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance of a thread winning critical section in the low-overhead spinning phase and minimize the chance of winning critical section in the high-overhead sleeping phase, so that COH is significantly reduced. Besides, we further observe that the cache invalidation-acknowledgement round-trip delay between the home node storing the critical section lock and the cores running competing locks can heavily downgrade application performance. To reduce such high lock coherence overhead (LCO), we propose in-network packet generation (iNPG) to turn passive ``normal'' NoC routers into active ``big'' ones that can not only transmit but also generate packets to perform early invalidation and collect inv-acks. iNPG effectively shortens the protocol round-trip delay and thus largely reduces LCO in various locking primitives.To enhance performance fairness when running multiple multi-threaded programs on a single CMP, we develop the concept of aggregate flow which refers to a sequence of associated data and cache coherence flows issued from the same thread. Based on the aggregate flow concept, we propose three coherent mechanisms to efficiently achieve performance isolation: rate profiling, rate inheritance and flow arbitration. Rate profiling dynamically characterizes thread performance and communication needs. Rate inheritance allows a data or coherence reply flow to inherit the characteristics of its associated data or coherency request flow, so that consistent bandwidth allocation policy is applied to all sub-flows of the same aggregate flow. Flow arbitration uses a proven scheduling policy, self-clocked fair queueing (SCFQ), to achieve rate-proportional arbitration for different aggregate flows. Our approach successfully achieves balanced performance isolations with different mixtures of applications.
  •  
9.
  • Zhao, Xueqian, 1986- (författare)
  • Network on Chip : Performance Bound and Tightness
  • 2015
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Featured with good scalability, modularity and large bandwidth, Network-on-Chip (NoC) has been widely applied in manycore Chip Multiprocessor (CMP) and Multiprocessor System-on-Chip (MPSoC) architectures. The provision of guaranteed service emerges as an important NoC design problem due to the application requirements in Quality-of-Service (QoS).Formal analysis of performance bounds plays a critical role in ensuring guaranteed service of NoC by giving insights into how the design parameters impact the network performance. The study in this thesis proposes analysis methods for delay and backlog bounds with Network Calculus (NC). Based on xMAS (eXecutable Micro-Architectural Specification), a formal framework to model communication fabrics, the delay bound analysis procedure is presented using NC. The micro-architectural xMAS representation of a canonical on-chip router is proposed with both the data flow and control flow well captured. Furthermore, a well-defined xMAS model for a specific application on an NoC can be created with network and flow knowledge and then be mapped to corresponding NC analysis model for end-to-end delay bound calculation. The xMAS model effectively bridges the gap between the informal NoC micro-architecture and the formal analysis model. Besides delay bound, the analysis of backlog bound is also crucial for predicting buffer dimensioning boundary in on-chip Virtual Channel (VC) routers. In this thesis, basic buffer use cases are identified with corresponding analysis models proposed so as to decompose the complex flow contention in a network. Then we develop a topology independent analysis technique to convey the backlog bound analysis step by step. Algorithms are developed to automate this analysis procedure.Accompanying the analysis of performance bounds, tightness evaluation is an essential step to ensure the validity of the analysis models. However, this evaluation process is often a tedious, time-consuming, and manual simulation process in which many simulation parameters may have to be configured before the simulations run. In this thesis, we develop a heuristics aided tightness evaluation method for the analytical delay and backlog bounds. The tightness evaluation is abstracted as constrained optimization problems with the objectives formulated as implicit functions with respect to the system parameters. Based on the well-defined problems, heuristics can be applied to guide a fully automated configuration searching process which incorporates cycle-accurate bit-accurate simulations. As an example of heuristics, Adaptive Simulated Annealing (ASA) is adopted to guide the search in the configuration space. Experiment results indicate that the performance analysis models based on NC give tight results which are effectively found by the heuristics aided evaluation process even the model has a multidimensional discrete search space and complex constraints.In order to facilitate xMAS modeling and corresponding validation of the performance analysis models, the thesis presents an xMAS tool developed in Simulink. It provides a friendly graphical interface for xMAS modeling and parameter configuring based on the powerful Simulink modeling environment. Hierarchical model build-up and Verilog-HDL code generation are essentially supported to manage complex models and conduct simulations. Attributed to the synthesizable xMAS library and the good extendibility, this xMAS tool has promising use in application specific NoC design based on the xMAS components.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-9 av 9

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy