SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Lu Zhonghai) srt2:(2010-2014)"

Sökning: WFRF:(Lu Zhonghai) > (2010-2014)

  • Resultat 1-50 av 93
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Anagnostopoulos, Iraklis, et al. (författare)
  • Custom Microcoded Dynamic Memory Management for Distributed On-Chip Memory Organizations
  • 2011
  • Ingår i: IEEE Embedded Systems Letters. - 1943-0663. ; 3:2, s. 66-69
  • Tidskriftsartikel (refereegranskat)abstract
    • Multiprocessor system-on-chip (MPSoCs) have attracted significant attention since they are recognized as a scalable paradigm to interconnect and organize a high number of cores. Current multicore embedded systems exhibit increased levels of dynamicbehavior, leading to unexpected memory footprint variations unknown at design time.Dynamic memory management (DMM) is a promising solution for such types of dynamicsystems. Although some efficient dynamic memory managers have been proposed for conventional bus-based MPSoC platforms, there are no DMM solutions regarding the constraints and the opportunities delivered by the physical distribution of multiple memorynodes of the platform. In this work, we address the problem of providing customizedmicrocoded DMM on MPSoC platforms with distributed memory organization. Customization is enabled at application-and platform-level. Results show that customizedmicrocoded DMM can serve approximately 7× more allocation requests compared to puredistributed memory platforms and perform 25% faster than the corresponding high-level implementation in C language. 
  •  
2.
  • Badawi, Mohammad, et al. (författare)
  • Customizable Coarse-grained Energy-efficient Reconfigurable Packet Processing Architecture
  • 2014
  • Ingår i: Proceedings Of The 2014 IEEE 25th International Conference on Application-specific Systems, Architectures and Processors (ASAP). - : IEEE. ; , s. 30-35
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we present a highly customizable and rapidly reconfigurable multi-core packet processing architecture that provides energy and area efficiency while retaining flexibility. Presented architecture with its agile reconfigurability permits time-critical adaptability where resources can be re-clustered at run time in few cycles, hence, maintaining efficiency if requirements of the use-case change. We elaborate the flexibility and adaptability of our architecture and we report its evaluation results. For evaluation, we performed the widely-used UDP/IP and we compared our proposed architecture to low-power 32-bit general purpose processors, a custom ASIC implementation and a programmable protocol processor. Compared to GPP-based solutions, our architecture is 20-34 times more energy efficient while providing 2.4-4.1 times higher throughput. While retaining the programmability, the proposed solution achieved 78% of the energy efficiency of hardwired ASIC implementation. Compared to a programmable protocol processor, our solution has 2.6 times more throughput and requires only a third of the gate count. lastly, we quantified the worst-case time and average-case time required for time-critical adaptability when reconfiguration occurs during a real-life Voice-Over IP traffic.
  •  
3.
  • Candaele, Bernard, et al. (författare)
  • Mapping Optimisation for Scalable multi-core ARchiTecture : The MOSART approach
  • 2010
  • Ingår i: Proceedings - IEEE Annual Symposium on VLSI, ISVLSI 2010. - 9780769540764 ; , s. 518-523
  • Konferensbidrag (refereegranskat)abstract
    • The project will address two main challenges of prevailing architectures: 1) The global Interconnect and memory bottleneck due to a single, globally shared memory with high access times and power consumption; 2) The difficulties in programming heterogeneous, multi-core platforms, in particular in dynamically managing data structures in distributed memory. MOSART aims to overcome these through a multi-core architecture with distributed memory organisation, a Network-on-Chip (NoC) communication backbone and configurable processing cores that are scaled, optimised and customised together to achieve diverse energy, performance, cost and size requirements of different classes of applications. MOSART achieves this by: A) Providing platform support for management of abstract data structures Including middleware services and a run-time data manager for NoC based communication infrastructure; 2) Developing tool support for parallelizing and mapping applications on the multi-core target platform and customizing the processing cores for the application.
  •  
4.
  • Candaele, Bernard, et al. (författare)
  • The MOSART Mapping Optimization for multi-core Architectures
  • 2011
  • Ingår i: VLSI 2010 Annual Symposium. - Dordrecht : Springer Publishing Company. ; , s. 181-195
  • Konferensbidrag (refereegranskat)abstract
    • MOSART project addresses two main challenges of prevailing architectures: (i) Theglobal interconnect and memory bottleneck due to a single, globally shared memorywith high access times and power consumption; (ii) The difficulties in programmingheterogeneous, multi-core platforms MOSART aims to overcome these through amulti-core architecture with distributed memory organization, a Network-on-Chip(NoC) communication backbone and configurable processing cores that are scaled,optimized and customized together to achieve diverse energy, performance, cost andsize requirements of different classes of applications. MOSART achieves this by:(i) Providing platform support for management of abstract data structures includingmiddleware services and a run-time data manager for NoC based communicationinfrastructure; (ii) Developing tool support for parallelizing and mapping applicationson the multi-core target platform and customizing the processing cores for theapplication.
  •  
5.
  • Chen, Xiaowen, et al. (författare)
  • Area and Performance Optimization of Barrier Synchronization on Multi-core Network-on-Chips
  • 2010
  • Ingår i: 3rd IEEE International Conference on Computer and Electrical Engineering (ICCEE).
  • Konferensbidrag (refereegranskat)abstract
    • Barrier synchronization is commonly and widelyused to synchronize the execution of parallel processor coreson multi-core Network-on-Chips (NoCs). Since its globalnature may cause heavy serialization resulting in largeperformance penalty, barrier synchronization should becarefully designed to have low latency communication and tominimize overall completion time. Therefore, in the paper, wepropose a fast barrier synchronization mechanism, targetingMulti-core NoCs. The fast barrier synchronization mechanismincludes a dedicated hardware module, named Fast BarrierSynchronizer (FBS), integrated with each processor node. Itoffers a set of barrier counters and can concurrently processsynchronization requests issued by the local node and remotenodes via the on-chip network. The salient feature of our fastbarrier synchronization mechanism is that, once the barriercondition is reached, the “barrier release” acknowledgement isrouted to all processor nodes in a broadcast way in order tosave chip area by avoiding storing source node informationand to minimize completion time by avoiding serialization ofbarrier releasing. Synthesis results suggest that the FBS canrun over 1 GHz in SMIC® 130nm technology with small areaoverhead. We implemented a FBS-enhanced multi-core NoCarchitecture on our FPGA platform using the Xilinx® Virtex 5as the FPGA chip. FPGA utilization and simulation resultsshow that our fast barrier synchronization demonstrates botharea and performance advantages over the barriersynchronization counterpart with unicast barrier releasing.
  •  
6.
  • Chen, Xiaowen, et al. (författare)
  • Cooperative communication based barrier synchronization in on-chip mesh architectures
  • 2011
  • Ingår i: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 8:22, s. 1856-1862
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose cooperative communication as a means to enable efficient and scalable barrier synchronization on mesh-based many-core architectures. Our approach is different from but orthogonal to conventional algorithm-based optimizations. It relies on collaborating routers to provide efficient gather and multicast communication. In conjunction with a master-slave algorithm, it exploits the mesh regularity to achieve efficiency. The gather and multicast functions have been implemented in our router. Synthesis results suggest marginal area overhead. With synthetic and benchmark experiments, we show that our approach significantly reduces synchronization completion time and increases speedup.
  •  
7.
  • Chen, Xiaowen, et al. (författare)
  • Cooperative communication for efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs
  • 2014
  • Ingår i: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 11:18, s. 20140542-
  • Tidskriftsartikel (refereegranskat)abstract
    • On many-core Network-on-Chips (NoCs), communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. Different from conventional algorithm-based approaches, the paper addresses the barrier synchronization problem from the angle of optimizing its communication performance and proposes cooperative communication as a means to achieve efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs. With the cooperative communication, routers collaborate with one another to accomplish a fast barrier synchronization task. The cooperative communication is implemented in our router at low cost. Through comparative experiments, our approach evidently exhibits high efficiency and good scalability.
  •  
8.
  • Chen, Xiaowen, et al. (författare)
  • Handling Shared Variable Synchronization in Multi-core Network-on-Chips with Distributed Memory
  • 2010
  • Ingår i: Proceedings. - 9781424466832 ; , s. 467-472
  • Konferensbidrag (refereegranskat)abstract
    • Parallelized shared variable applications running on multi-core Network-on-Chips(NoCs) require efficient support for synchronization, since communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. In this paper, we propose a dedicated hardware module forsynchronization management. This module is called Synchronization Handler (SH), integrated with each processor-memory node on the multi-core NoCs. It uses two physical buffers to concurrently process synchronization requests issued by the local processor and remote processors via the on-chip network. One salient feature is that the two physical buffers are dynamically allocated to form multiple virtual buffers (a virtual buffer is related to a shared synchronization variable) so as to improve the buffer utilization and alleviate the head-of-line blocking. Synthesis results suggest that the SH can run over 900 MHz in 130nm technology with small area overhead. To justify the SH-enhanced multicore NoCs, we employ synthetic workloads to evaluate synchronizationcost and buffer utilization, and run synchronization-intensive applications to investigate speedup. The results show that our approach is viable.
  •  
9.
  • Chen, Xiaowen, et al. (författare)
  • Hybrid distributed shared memory space in multi-core processors
  • 2011
  • Ingår i: Journal of Software. - : International Academy Publishing (IAP). - 1796-217X. ; 6:12 SPEC. ISSUE, s. 2369-2378
  • Tidskriftsartikel (refereegranskat)abstract
    • On multi-core processors, memories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memory addresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtualto- Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. The hybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressing on shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-time partitioning of hybrid DSM organization in order to analyze its performance. A real DSM based multi-core platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioning demonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improvement depends on problem size, way of data partitioning and computation/communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.
  •  
10.
  • Chen, Xiaowen, et al. (författare)
  • Multi-FPGA Implementation of a Network-on-Chip Based Many-core Architecture with Fast Barrier Synchronization Mechanism
  • 2010
  • Ingår i: Proceedings of the IEEE Norchip Conference. - 9781424489732
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we propose a fast barrier synchronization mechanism, targetingNetwork-on-Chip based manycore architectures. Its salient feature is that, once thebarrier condition is reached, the "barrier release" acknowledgement is routed to all processor nodes in a broadcast way in order to save area by avoiding storing source node information and to minimize completion time by eliminating serialization of barrierreleasing. Then, we construct a multi-FPGA platform using Xilinx® Virtex 5 as FPGA chipsand implement a NoC based many-core architecture on it. FPGA utilization and simulation results show that our mechanism demonstrates both area and performance advantages over the barrier synchronization counterpart with unicast barrier releasing. 
  •  
11.
  • Chen, Xiaowen, et al. (författare)
  • Reducing Virtual-to-Physical address translation overhead in Distributed Shared Memory based multi-core Network-on-Chips according to data property
  • 2013
  • Ingår i: Computers & electrical engineering. - : Elsevier BV. - 0045-7906 .- 1879-0755. ; 39:2, s. 596-612
  • Tidskriftsartikel (refereegranskat)abstract
    • In Network-on-Chip (NoC) based multi-core platforms, Distributed Shared Memory (DSM) preferably uses virtual addressing in order to hide the physical locations of the memories. However, this incurs performance penalty due to the Virtual-to-Physical (V2P) address translation overhead for all memory accesses. Based on the data property which can be either private or shared, this paper proposes a hybrid DSM which partitions a local memory into a private and a shared part. The private part is accessed directly using physical addressing and the shared part using virtual addressing. In particular, the partitioning boundary can be configured statically at design time and dynamically at runtime. The dynamic configuration further removes the V2P address translation overhead for those data with changeable property when they become private at runtime. In the experiments with three applications (matrix multiplication, 2D FFT, and H.264/AVC encoding), compared with the conventional DSM, our techniques show performance improvement up to 37.89%.
  •  
12.
  • Chen, Xiaowen, 1982-, et al. (författare)
  • Run-time Partitioning of Hybrid Distributed Shared Memory on Multi-core Network-on-Chips
  • 2010
  • Ingår i: The 3rd IEEE International Symposium on Parallel Architectures, Algorithms and Programming (PAAP 2010). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 39-46
  • Konferensbidrag (refereegranskat)abstract
    • On multi-core Network-on-Chips (NoCs), mem- ories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memoryaddresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtual-to-Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. Thehybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressingon shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-timepartitioning of hybrid DSM organization in order to analyze its perfor- mance. A real DSM based multi-core NoC platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioningdemonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improve- ment depends on problem size, way of datapartitioning and computation/ communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.
  •  
13.
  • Chen, Xiaowen, et al. (författare)
  • Supporting Distributed Shared Memory on Multi-core Network-on-Chips Using a Dual Microcoded Controller
  • 2010
  • Ingår i: Proceedings of the conference for Design Automation and Test in Europe. ; , s. 39-44
  • Konferensbidrag (refereegranskat)abstract
    • Supporting Distributed Shared Memory (DSM) is essential for multi-coreNetwork-on-Chips for the sake of reusing huge amount of legacy code and easy programmability. We propose a microcoded controller as a hardware module in each node to connect the core, the local memory and the network. The controller is programmable where the DSM functions such as virtual-to-physical address translation,memory access and synchronization etc. are realized using microcode. To enable concurrent processing of memory requests from the local and remote cores, ourcontroller features two mini-processors, one dealing with requests from the local coreand the other from remote cores. Synthesis results suggest that the controller consumes 51k gates for the logic and can run up to 455 MHz in 130 nm technology. To evaluate its performance, we use synthetic and application workloads. Results show that, when the system size is scaled up, the delay overhead incurred by the controller may become less significant when compared with the network delay. In this way, the delay efficiency of our DSM solution is close to hardware solutions on average but still have all the flexibility of software solutions.
  •  
14.
  • Chen, Xiaowen, 1982-, et al. (författare)
  • Supporting Efficient Synchronization in Multi-core NoCs Using Dynamic Buffer Allocation Technique
  • 2010
  • Ingår i: Proceedings of the IEEE Annual Symposium on VLSI. - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 462-463
  • Konferensbidrag (refereegranskat)abstract
    • This paper explores a dynamic buffer allocation technique to guide a distributedsynchronization architecture to support efficient synchronization on multi-core Network-on-Chips (NoCs). The synchronization architecture features two physical buffers to be able to concurrently queue and handle synchronization requests issued by the local processor and remote processors via the on-chip network. Using the dynamic bufferallocation technique, the two physical buffers are dynamically allocated to form multiple virtual buffers in order to improve buffers' utilization. Experiments are carried on to evaluate buffers' utilization.
  •  
15.
  • Chen, Y., et al. (författare)
  • A deadlock-free fault-tolerant routing algorithm based on pseudo-receiving mechanism for networks-on-chip of CMP
  • 2011
  • Ingår i: 2011 International Conference on Multimedia Technology, ICMT 2011. - 9781612847719 ; , s. 2825-2828
  • Konferensbidrag (refereegranskat)abstract
    • As the size of CMOS technology scales down to nanometers domain, fault-tolerant is becoming a challenge for NoC. Turn model provides a simple and efficient systematic approach to the development of deadlock-free routing algorithms. In this paper, we propose a pseudo-receiving mechanism based on the support of local processor's cache to enable prohibited turn, and meanwhile make it keep deadlockfree. We present a fault-tolerant routing algorithm based on pseudo-receiving mechanism for 2D mesh. The routing algorithm is livelock-free in the cost of disable a few un-faulty links or nodes. The algorithm is applied to a single-cycle fixed output-buffered router. Experimental results show that, it achieves high performance even under high faulty rate.
  •  
16.
  • Chen, Yancang, et al. (författare)
  • A single-cycle output buffered router with layered switching for Networks-on-Chips
  • 2012
  • Ingår i: Computers & electrical engineering. - : Elsevier BV. - 0045-7906 .- 1879-0755. ; 38:4, s. 906-916
  • Tidskriftsartikel (refereegranskat)abstract
    • We present a single-cycle output buffered router based on layered switching for networks on chips (NoCs). Different from state-of-the-art NoC routers, the router has three important characteristics: (1) It employs layered switching, which implements wormhole on top of virtual cut-through (VCT) switching; (2) In contrast to input buffered architectures, it adopts an output buffered architecture; (3) It is single cycle, meaning that the router pipeline takes only one cycle for all flits. Experimental results show that the router achieves up to 80% of ideal network throughput under uniform random traffic pattern. Compared with wormhole switching, layered switching achieves up to 36.9% latency reduction for 12-flit packets under uniform random traffic with an injection rate of 0.5 flit/cycle/node. Under 65 nm technology synthesized results show that its critical path has only 20 logic gates, and it reduces 11% area compared to the input virtual-channel router with the same buffer capacity.
  •  
17.
  • Chen, Yancang, et al. (författare)
  • A Trace-driven Hardware-level Simulator for Design and Verification of Network-on-Chips
  • 2010
  • Ingår i: 2011 INTERNATIONAL CONFERENCE ON COMPUTERS, COMMUNICATIONS, CONTROL AND AUTOMATION (CCCA 2011), VOL II. - : IEEE. ; , s. 32-35
  • Konferensbidrag (refereegranskat)abstract
    • Traditional communications of general-purpose multi-core processor and application-specific System-on-Chip face challenges in terms of scalability and complexity. Network-on-Chip (NoC) has been the most promising solution for the communications of multi-core and many-core chips. In this paper, we present a trace-driven hardware-level simulator (noted HS) based on SystemVerilog for the design and verification of NoCs. Different from the state-of-the-art NoC simulators, the HS owns three important characteristics in addition to the capability of creating simulation and synthesizable NoC descriptions: 1) hardware-level simulation can be done, which means more implementation details of hardware than flit-level simulation; 2) router debugging and verification can be done at RTL by inserting assertions and coverage; 3) trace-based application simulations can be done besides synthetic workloads. A 4 X 4 2D mesh NoC with output virtual-channel routers verifies the capability of our HS.
  •  
18.
  • Chen, Y., et al. (författare)
  • Slice router : For fine-granularity fault-tolerant Networks-on-Chip
  • 2011
  • Ingår i: 2011 International Conference on Multimedia Technology, ICMT 2011. - 9781612847740 ; , s. 3230-3233
  • Konferensbidrag (refereegranskat)abstract
    • Almost all existing Networks-on-Chip (NoC) faulttolerant schemes are based on fault-tolerant routing algorithms. In these fault-tolerant schemes, faulty links or routers will be discarded all together. However, only a few part of the discarded link or router is faulty in most cases. It is wasteful to discard the whole link or router. In this paper, we present a slice router architecture which can be used in fine-granularity fault-tolerant NoC. The major motivation of presenting slice router is to refine faulty links and routers. The major idea is that a router is split into several sub-link routers, noted slices. Different from several physically independent routers, slices are coupled together in input/output ports. The coupling of slices makes the network to be able to fine-granularity fault-tolerant. In order to evaluate the fault-tolerant capability of slice routers, we design a looselycoupled 4-slices router with a backup sub-link in each link. Each slice is a single-cycle output buffered switch. Simulation results prove its fault-tolerant capability in the present of high faulty rates. The critical latency is only increased 0.04ns, because the configuration of slice interfaces is parallel with the output arbiter of slices. Under 65nm technology synthesized results show that, the increased area overhead of a slice router is only a few logic gates compared with the non-coupled slice router.
  •  
19.
  • Du, G., et al. (författare)
  • An analytical model for worst-case reorder buffer size of multi-path minimal routing NoCs
  • 2014
  • Ingår i: Proceedings - 2014 8th IEEE/ACM International Symposium on Networks-on-Chip, NoCS 2014. - : IEEE. - 9781479953479 ; , s. 49-56
  • Konferensbidrag (refereegranskat)abstract
    • Reorder buffers are often needed in multi-path routing networks-on-chips (NoCs) to guarantee in-order packet delivery. However, the buffer sizes are usually over-dimensioned, due to lack of worst-case analysis, leading to unnecessary larger area overhead. Based on network calculus, we propose an analysis framework for the worst-case reorder buffer size in multi-path minimal routing NoCs. Experiments with synthetic traffic and an industry case show that our method can effectively explore the traffic splitting space, as well as the mapping effects in terms of reorder buffer size with a maximum improvement of 36.50%.
  •  
20.
  • Du, G., et al. (författare)
  • Worst-case performance analysis of 2-D mesh NoCs using multi-path minimal routing
  • 2012
  • Ingår i: CODES+ISSS'12 - Proceedings of the 10th ACM International Conference on Hardware/Software-Codesign and System Synthesis, Co-located with ESWEEK. - New York, NY, USA : ACM Publications. - 9781450314268 ; , s. 123-132
  • Konferensbidrag (refereegranskat)abstract
    • In Network-on-Chip (NoC), multi-path routing is often preferable than single-path routing since it can better balance workload and thus provide better performance. However, performance analysis with multi-path routing is much more difficult due to complicated contention scenarios. Based on network calculus, we study worst-case performance of deterministic multi-path minimal routing on 2-D mesh NoCs. We first present a per-flow delay bound analysis technique for multi-path routing, which extends the analysis for singlepath routing but deals with traffic splitting. Then we define a contention matrix to capture network congestion status. Based on the contention matrix, we propose an effective nonuniform traffic splitting strategy to improve worst-case performance. Experiments with synthetic traffic flows and an industrial case show that our analysis can effectively explore the traffic splitting space, and verify the effectiveness of the non-uniform splitting policy.
  •  
21.
  • Eslami Kiasari, Abbas, et al. (författare)
  • A Framework for Designing Congestion-Aware Deterministic Routing
  • 2010
  • Ingår i: NoCArc '10 Proceedings of the Third International Workshop on Network on Chip Architectures. - New York, NY, USA : ACM. - 9781450303972 ; , s. 45-50
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we present a system-level Congestion-Aware Routing (CAR) framework for designing minimal deterministic routing algorithms. CAR exploits the peculiarities of the application workload to spread the load evenly across the network. To this end, we first formulate an optimization problem of minimizing the level of congestion in the network and then use the simulated annealing heuristic to solve this problem. The proposed framework assures deadlock-free routing, even in the networks without virtual channels. Experiments with both synthetic and realistic workloads show the effectiveness of the CAR framework. Results show that maximum sustainable throughput of the network is improved by up to 205% for different applications and architectures.
  •  
22.
  • Eslami Kiasari, Abbas, et al. (författare)
  • A Heuristic Framework for Designing and Exploring Deterministic Routing Algorithm for NoCs
  • 2013
  • Ingår i: Algorithms in Networks-on-Chip. - New York, NY : Springer. - 9781461482734 ; , s. 21-39
  • Bokkapitel (refereegranskat)abstract
    • In this chapter, we present a system-level framework for designing minimal deterministic routing algorithms for Networks-on-Chip (NoCs) that are customized for a set of applications. To this end, we first formulate an optimization problem of minimizing average packet latency in the network and then use the simulated annealing heuristic to solve this problem. To estimate the average packet latency we use a queueing-based analytical model which can capture the burstiness of the traffic. The proposed framework does not require virtual channels to guarantee deadlock freedom since routes are extracted from an acyclic channel dependency graph. Experiments with both synthetic and realistic workloads show the effectiveness of the approach. Results show that maximum sustainable throughput of the network is improved for different applications and architectures.
  •  
23.
  • Eslami Kiasari, Abbas, et al. (författare)
  • An Analytical Latency Model for Networks-on-Chip
  • 2013
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - 1063-8210 .- 1557-9999. ; 21:1, s. 113-123
  • Tidskriftsartikel (refereegranskat)abstract
    • We propose an analytical model based on queueing theory for delay analysis in a wormhole-switched network-on-chip (NoC). The proposed model takes as input an application communication graph, a topology graph, a mapping vector, and a routing matrix, and estimates average packet latency and router blocking time. It works for arbitrary network topology with deterministic routing under arbitrary traffic patterns. This model can estimate per-flow average latency accurately and quickly, thus enabling fast design space exploration of various design parameters in NoC designs. Experimental results show that the proposed analytical model can predict the average packet latency more than four orders of magnitude faster than an accurate simulation, while the computation error is less than 10% in non-saturated networks for different system-on-chip platforms.
  •  
24.
  • Eslami Kiasari, Abbas, et al. (författare)
  • Analytical approaches for performance evaluation of networks-on-chip
  • 2012
  • Ingår i: CASES'12 - Proceedings of the 2012 ACM International Conference on Compilers, Architectures and Synthesis for Embedded Systems, Co-located with ESWEEK. - New York, NY, USA : ACM. - 9781450314244 ; , s. 211-212
  • Konferensbidrag (refereegranskat)abstract
    • This tutorial reviews four popular mathematical formalisms - dataflow analysis, schedulability analysis, network calculus, and queueing theory - and how they have been applied to the analysis of Network-on-Chip (NoC) performance. We review the basic concepts and results of each formalism and provide examples of how they have been used in on-chip communication performance analysis. The tutorial also discusses the respective strengths and weaknesses of each formalism, their suitability for a specific purpose, and the attempts that have been made to bridge these analytical approaches. Finally, we conclude the tutorial by discussing open research issues.
  •  
25.
  • Eslami Kiasari, Abbas, et al. (författare)
  • Mathematical formalisms for performance evaluation of networks-on-chip
  • 2013
  • Ingår i: ACM Computing Surveys. - : Association for Computing Machinery (ACM). - 0360-0300 .- 1557-7341. ; 45:3, s. 38-
  • Tidskriftsartikel (refereegranskat)abstract
    • This article reviews four popular mathematical formalisms-queueing theory, network calculus, schedulability analysis, anddataflow analysis-and how they have been applied to the analysis of on-chip communication performance in Systems-on-Chip. The article discusses the basic concepts and results of each formalism and provides examples of how they have been used in Networks-on-Chip (NoCs) performance analysis. Also, the respective strengths and weaknesses of each technique and its suitability for a specific purpose are investigated. An open research issue is a unified analytical model for a comprehensive performance evaluation of NoCs. To this end, this article reviews the attempts that have been made to bridge these formalisms.
  •  
26.
  • Feng, Chaochao, et al. (författare)
  • A 1-Cycle 1.25 GHz Bufferless Router for 3D Network-on-Chip
  • 2012
  • Ingår i: IEICE transactions on information and systems. - 0916-8532 .- 1745-1361. ; E95D:5, s. 1519-1522
  • Tidskriftsartikel (refereegranskat)abstract
    • In this paper, we propose a 1-cycle high-performance 3D bufferless router with a 3-stage permutation network. The proposed router utilizes the 3-stage permutation network instead of the serialized switch allocator and 7 x 7 crossbar to achieve the frequency of 1.25 GHz in TSMC 65 nm technology. Compared with the other two 3D bufferless routers, the proposed router occupies less area and consumes less power consumption. Simulation results under both synthetic and application workloads illustrate that the proposed router achieves less average packet latency than the other two 3D bufferless routers.
  •  
27.
  •  
28.
  •  
29.
  • Feng, C., et al. (författare)
  • Addressing transient and permanent faults in NoC with efficient fault-tolerant deflection router
  • 2013
  • Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - 1063-8210 .- 1557-9999. ; 21:6, s. 1053-1066
  • Tidskriftsartikel (refereegranskat)abstract
    • Continuing decrease in the feature size of integrated circuits leads to increases in susceptibility to transient and permanent faults. This paper proposes a fault-tolerant solution for a bufferless network-on-chip, including an on-line fault-diagnosis mechanism to detect both transient and permanent faults, a hybrid automatic repeat request, and forward error correction link-level error control scheme to handle transient faults and a reinforcement-learning-based fault-tolerant deflection routing (FTDR) algorithm to tolerate permanent faults without deadlock and livelock. A hierarchical-routing-table-based algorithm (FTDR-H) is also presented to reduce the area overhead of the FTDR router. Synthesized results show that, compared with the FTDR router, the FTDR-H router can reduce the area by 27% in an 8×8 network. Simulation results demonstrate that under synthetic workloads, in the presence of permanent link faults, the throughput of an 8×8 network with FTDR and FTDR-H algorithms are 14% and 23% higher on average than that with the fault-on-neighbor (FoN) aware deflection routing algorithm and the cost-based deflection routing algorithm, respectively. Under real application workloads, the FTDR-H algorithm achieves 20% less hop counts on average than that of the FoN algorithm. For transient faults, the performance of the FTDR router can achieve graceful degradation even at a high fault rate. We also implement the fault-tolerant deflection router which can achieve 400 MHz in TSMC 65-nm technology.
  •  
30.
  • Feng, C. -C, et al. (författare)
  • A 1-cycle 2 GHz bufferless router for network-on-chip
  • 2011
  • Ingår i: Guofang Keji Daxue Xuebao/Journal of National University of Defense Technology. - 1001-2486. ; 33:6, s. 42-47
  • Tidskriftsartikel (refereegranskat)abstract
    • Recently, bufferless router, which does not need buffers, has become a low-cost solution for Network-on-Chip. To improve the performance of the bufferless router, a 1-cycle high-performance bufferless router was proposed for Network-on-Chip. The router used a simple permutation network instead of the serialized switch allocator and the crossbar to achieve high performance. Compared with the virtual channel router and the baseline bufferless router, the proposed bufferless router can achieve the frequency of 2 GHz with small area cost under TSMC 65 nm technology. Simulation results under both synthetic and application workloads demonstrate that the proposed bufferless router achieves much less average packet latency than the virtual channel router and other bufferless routers.
  •  
31.
  • Feng, Chaochao, et al. (författare)
  • Evaluation of Deflection Routing on Various NoC Topologies
  • 2011
  • Ingår i: Proceedings of the IEEE International Conference on ASIC (ASICON).
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we propose two novel deflection routing algorithms for de Bruijn and Spidergon NoCs and evaluate the performance of the deflection routing on 5 NoC topologies with different synthetic traffic patterns. We also synthesize the routers in various NoC topologies with TSMC 65nm technology. The evaluation results illustrate that the performance of deflection routing is susceptible to the network topology and traffic pattern. The results can also guide the NoC architect to choose the suitable NoC topology for the specific application.
  •  
32.
  • Feng, Chaochao, et al. (författare)
  • FoN : Fault-on-Neighbor aware Routing Algorithm for Networks-on-Chip
  • 2010
  • Ingår i: Proceedings - IEEE International SOC Conference, SOCC 2010. - 9781424466832 ; , s. 441-446
  • Konferensbidrag (refereegranskat)abstract
    • Reliability has become a key issue of Networks-on-Chip (NoC) as the CMOS technology scales down to the nanoscale domain. This paper proposes a Fault-on-Neighbor (FoN) aware deflection routing algorithm for NoC which makes routing decision based on the link status of neighbor switches within 2 hops to avoid fault links and switches. Simulation results demonstrate that in the presence of faults, the saturated throughput of the FoN switch is 13% higher on average than a cost-based deflection switch for 88 mesh. The average hop counts can be up to 1.7 less than the cost-based switch. The FoN switch is also synthesized using 65nm TSMC technology and it can work at 500MHz with small area overhead.
  •  
33.
  • Feng, Chaochao, et al. (författare)
  • Support Efficient and Fault-Tolerant Multicast in Bufferless Network-on-Chip
  • 2012
  • Ingår i: IEICE transactions on information and systems. - 0916-8532 .- 1745-1361. ; E95D:4, s. 1052-1061
  • Tidskriftsartikel (refereegranskat)abstract
    • In this paper, we propose three Deflection-Routing-based Multicast (DRM) schemes for a bufferless NoC. The DRM scheme without packets replication (DRM_noPR) sends multicast packet through a non-deterministic path. The DRM schemes with adaptive packets replication (DRM_PR_src and DRM_PR_all) replicate multicast packets at the source or intermediate node according to the destination position and the state of output ports to reduce the average multicast latency. We also provide fault-tolerant supporting in these schemes through a reinforcement-learning-based method to reconfigure the routing table to tolerate permanent faulty links in the network. Simulation results illustrate that the DRM_PR_all scheme achieves 41%, 43% and 37% less latency on average than that of the DRM_noPR scheme and 27%, 29% and 25% less latency on average than that of the DRM_PR_src scheme under three synthetic traffic patterns respectively. In addition, all three fault-tolerant DRM schemes achieve acceptable performance degradation at various link fault rates without any packet lost.
  •  
34.
  • Hu, Wenmin, et al. (författare)
  • A flexible configuration approach for fault-tolerant multicast/unicast
  • 2011
  • Ingår i: IEEE Int. Conf. Commun. Softw. Networks, ICCSN. - 9781612844855 ; , s. 393-396
  • Konferensbidrag (refereegranskat)abstract
    • A flexible approach for lookup table configuration is proposed. In this scheme, a predetermined path is setup in parallel by several unicastsetup packets. Compared with other approaches, our scheme eliminates the overhead of configuration bus by adding little logic to existing multicast router based on lookup table. This extension makes any-shaped path setup possible, which benefits fault-tolerance on Network-on-Chip (NoC).
  •  
35.
  • Hu, Wenmin, et al. (författare)
  • Multicast Path Setup Incorporating Evicting
  • 2012
  • Ingår i: Elektronika ir Elektrotechnika. - : Kaunas University of Technology (KTU). - 1392-1215. ; :8, s. 101-104
  • Tidskriftsartikel (refereegranskat)abstract
    • In this paper, we propose a novel multicast path setup scheme, which incorporates the evicting process. Compared with the previous work, our scheme either overcomes the limitation in evicting times or reduces the setup latency.
  •  
36.
  • Hu, Wenmin, et al. (författare)
  • Network-on-Chip Multicasting with Low Latency Path Setup
  • 2011
  • Ingår i: Proceedings of the VLSI-SoC Conference.
  • Konferensbidrag (refereegranskat)abstract
    • A low-latency path setup approach with multiple setup packets for parallel set is presented. It reduces the header overhead compared to multiaddress encoding. Further, we propose four variants of deadlock-free multicast routing algorithms using different subpath generation methods, different destination partitioning, and channel sharing strategies. Experimental results show that the quatuor partitions path-like tree outperforms other algorithms.
  •  
37.
  • Hu, Wenmin, et al. (författare)
  • Power-efficient Tree-based Multicast Support for Networks-on-Chip
  • 2011
  • Ingår i: Proceedings of the Asian Pacific Design Automation Conference (ASPDAC). - : IEEE. - 9781424475162 ; , s. 363-368
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, a novel hardware support for multicast on mesh Networks-on-Chip (NoC) is proposed. It supports multicast routing on any shape of tree-based paths. Two power-efficient tree-based multicast routing algorithms, Optimized tree (OPT) and Left-XY-Right-Optimized tree (LXYROPT) are also proposed. XY tree-based (XYT) algorithm and multiple unicast copies (MUC) are also implemented on the router as baselines. Along with the increase of the destination size, compared with MUC, OPT and LXYROPT achieve a remarkable improvement in both latency and throughput while the average power consumption is reduced by 50% and 45%, respectively. Compared with XYT, OPT is 10% higher in latency but gains 17% saving in power consumption. LXYROPT is 3% lower in latency and 8% lower in power consumption. In some cases, OPT and LXYROPT give power saving up to 70% less than the XYT.
  •  
38.
  • Hu, Wenmin, et al. (författare)
  • Self-selection pseudo-circuit : a clever crossbar pre- allocation
  • 2012
  • Ingår i: IEICE Electronics Express. - : Institute of Electronics, Information and Communications Engineers (IEICE). - 1349-2543. ; 9:6, s. 558-564
  • Tidskriftsartikel (refereegranskat)abstract
    • This paper proposes self-selection pseudo- circuit (SP), a simple and effective approach to increase switch connection reusing rate and improve the network performance. It especially suits the network in which the performance is dominated by the number of hops. In SP scheme, multiple switch connections are allowed to be reserved for one inport, and the flit can reuse the partial switch connection(s) based on the routing information. For the evaluation with the traces from Splash-2, SP reduces the interconnection latency by up to 21.6% (16.9% average) with 16-core CMP configuration, and 22.2% ( 19.5 on average) with 64- core CMP configuration. Evaluated with synthetic traffic, the proposed scheme decreases the latency up to 19% ( 16% average).
  •  
39.
  • Hu, Wenmin, et al. (författare)
  • TPSS : A flexible hardware support for unicast and multicast on networks-on-chip
  • 2012
  • Ingår i: Journal of Computers. - : International Academy Publishing (IAP). - 1796-203X. ; 7:7, s. 1743-1752
  • Tidskriftsartikel (refereegranskat)abstract
    • Multicast is an important traffic mode that runs on multi-core systems, and an efficient hardware support for multicast can greatly improve the performance of the whole system. Most multicast solutions use the dimension-order routing to generate the mutlicast trees, which are neither bandwidth nor power efficient. This article presents a synthesizable router for network-on-chip (NoC) which supports arbitrarily shaped multicast path based on a mesh topology. In our scheme, incremental setup is adopted to simplify the process of multicast tree construction. For each sub-path setup, we present a novel scheme called two period sub-path setup (TPSS). TPSS is divided into two periods: routing to a predeterminate intermediate router, and updating lookup tables from the intermediate router to destination. This novel setup makes it feasible to support arbitrarily shaped path setup. In our case study, Optimized tree algorithm (OPT) and Left-XY-Right-Optimized tree algorithm (LXYROPT) are proposed for power-efficient path searching, but they need to be pre-configured for the reason of high computation cost. Moreover, Virtual Circuit Tree Multicasting (VCTM) is also supported in our scheme for dynamic construction of multicast path, which needs no computation in path searching. The performance is evaluated by using a cycle accurate simulator developed in SystemC, and the hardware overhead is estimated by using a synthesizable HDL model. Compared to VCTM (without FIFO, multicast table and network adapter), the area overhead of implementing our router is negligible (less than 0.5%).
  •  
40.
  • Jafari, Fahimeh, et al. (författare)
  • Buffer Optimization in Network-on-Chip Through Flow Regulation
  • 2010
  • Ingår i: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. - 0278-0070 .- 1937-4151. ; 29:12, s. 1973-1986
  • Tidskriftsartikel (refereegranskat)abstract
    • For network-on-chip (NoC) designs, optimizing buffers is an essential task since buffers are a major source of cost and power consumption. This paper proposes flow regulation and has defined a regulation spectrum as a means for system-on-chip architects to control delay and backlog bounds. The regulation is performed per flow for its peak rate and burstiness. However, many flows may have conflicting regulation requirements due to interferences with each other. Based on the regulation spectrum, this paper optimizes the regulation parameters aiming for buffer optimization. Three timing-constrained buffer optimization problems are formulated, namely, buffer size minimization, buffer variance minimization, and multiobjective optimization, which has both buffer size and variance as minimization objectives. Minimizing buffer variance is also important because it affects the modularity of routers and network interfaces. A realistic case study exhibits 62.8% reduction of total buffers, 84.3% reduction of total latency, and 94.4% reduction on the sum of variances of buffers. Likewise, the experimental results demonstrate similar improvements in the case of synthetic traffic patterns. The optimization algorithm has low run-time complexity, enabling quick exploration of large design spaces. This paper concludes that optimal flow regulation can be a highly valuable instrument for buffer optimization in NoC designs.
  •  
41.
  • Jafari, Fahimeh, et al. (författare)
  • Optimal Regulation of Traffic Flows in Networks-on-Chip
  • 2010
  • Ingår i: Proceedings of the Design Automation and Test Europe Conference (DATE). - : IEEE Computer Society. - 9783981080162 ; , s. 1621-1624
  • Konferensbidrag (refereegranskat)abstract
    • We have proposed (σ, ρ)-based flow regulation to reduce delay and backlog bounds in SoC architectures, where σ bounds the traffic burstiness and ρ the traffic rate. The regulation is conducted per-flow for its peak rate and traffic burstiness. In this paper, we optimize these regulation parameters in networks on chips where many flows may have conflicting regulation requirements. We formulate an optimization problem for minimizing total buffers under performance constraints. We solve the problem with the interior point method. Our case study results exhibit 48% reduction of total buffers and 16% reduction of total latency for the proposed problem. The optimization solution has low run-time complexity, enabling quick exploration of large design space.
  •  
42.
  • Jafari, Fahimeh, et al. (författare)
  • Output Process of Variable Bit-Rate Flows in On-Chip Networks Based on Aggregate Scheduling
  • 2011
  • Ingår i: Proceedings of the International Conference on Computer Design. - 9781457719523 ; , s. 445-446
  • Konferensbidrag (refereegranskat)abstract
    •  In NoCs often several flows are merged into one aggregate flow due to heavy resource sharing. For strengthening formal performance analysis, we propose an improved model for an output flow of a FIFO multiplexer under aggregate scheduling. The model of the aggregate flow is formally proven and can serve as the basis for a stringent worst case delay and buffer analysis.
  •  
43.
  • Jafari, Fahimeh, et al. (författare)
  • Worst-Case Delay Analysis of Variable Bit-Rate Flows in Network-on-Chip with Aggregate Scheduling
  • 2012
  • Ingår i: Proceedings of the Design and Test in Europe Conference (DATE). - : IEEE. - 9783981080186 ; , s. 538-541
  • Konferensbidrag (refereegranskat)abstract
    • Aggregate scheduling in routers merges several flows into one aggregate flow. We propose an approach for computing the end-to-end delay bound of individual flows in a FIFO multiplexer under aggregate scheduling. A synthetic case study exhibits that the end-to-end delay bound is up to 33.6% tighter than the case without considering the traffic peak behavior.
  •  
44.
  • Jantsch, Axel, et al. (författare)
  • Memory Architecture and Management in an NoC Platform
  • 2011. - 1
  • Ingår i: Scalable Multi-core Architectures. - New York, NY : Springer. - 9781441967770 ; , s. 3-28
  • Bokkapitel (refereegranskat)abstract
    • The memory organization and the management of the memory space is a critical part of every NoC based platform design. We propose a Data Management Engine (DME), that is a block of programmable hardware and part of every processing element. It off-loads the processing element (CPU, DSP, etc.) by managing the memory space, memory access and the communication over the on-chip network. The DME’s main functions are virtual address translation, private and shared memory management, cache coherence protocol, support for memory consistency models, synchronization and protection mechanisms for shared memory communication. The DME is fully programmable and configurable thus allowing for customized support for high level data management functions such as dynamic memory allocation and abstract data types. This chapter describes the main concepts, design and functionality of the DME and presents case studies illustrating its usage and performance.
  •  
45.
  • Jiang, X., et al. (författare)
  • An enhanced iot gateway in a broadcast system
  • 2012
  • Ingår i: Proceedings - IEEE 9th International Conference on Ubiquitous Intelligence and Computing and IEEE 9th International Conference on Autonomic and Trusted Computing, UIC-ATC 2012. - : IEEE. - 9780769548432 ; , s. 746-751
  • Konferensbidrag (refereegranskat)abstract
    • Internet of things (IOT) technology increasingly affects backbone network development. Aiming to reduce or weaken such effects on backbone networks from a traffic perspective and attain no discount on functions in a CobraNet based digital broadcast system, an enhanced IOT gateway and its evaluation system are presented. Based on the prototype evaluation system, experiments demonstrate the IOT gateway not only inherits the conventional functions previously often carried out by a PC platform or a server, but also develops additional abilities to cut down traffic under some cases and provides portable management for the system at the same time. According to the experimental results, it is possible from a traffic perspective to design an enhanced IOT gateway to provide value added service such as file transfer, message function, etc. in addition to high quality audio service by taking advantage of the verified sub-linear information addition mechanism. Such enhanced gateway can be properly adapted to other kinds of applications according to the nondemanding system requirements of the prototype evaluation system and proposed IOTGW design.
  •  
46.
  • Jiang, X., et al. (författare)
  • Lessons of IOT effects on backbone networks learnt from traffic characteristics
  • 2012
  • Ingår i: Int. Conf. Wirel. Commun., Networking Mob. Comput., WiCOM. - 9781612846835
  • Konferensbidrag (refereegranskat)abstract
    • Emerging and blooming internet of things (IOT)technology greatly affects backbone network development. The effects are not accurately tested or predicted yet. As a trial, some traffic characteristics profiling cases are investigated based on a gateway in IOT (IOTGW), a classical device, for data communication between IOT and backbone networks. From this investigation, some lessons of IOT effects on backbone networks are learnt by taking application, network size, data volume, etc. into account. To our knowledge, this study is the first trial to investigate the IOT effects on backbone networks from the traffic perspective.
  •  
47.
  • Liu, Ming, et al. (författare)
  • A High-End Reconfigurable Computation Platform for Nuclear and Particle Physics Experiments
  • 2011
  • Ingår i: Computing in science & engineering (Print). - 1521-9615 .- 1558-366X. ; 13:2, s. 52-63
  • Tidskriftsartikel (refereegranskat)abstract
    • A high-performance computation platform based on field-programmable gate arrays targets nuclear and particle physics experiment applications. The system can be constructed or scaled into a supercomputer-equivalent size for detector data processing by inserting compute nodes into advanced telecommunications computing architecture (ATCA) crates. Among the case study results are that one ATCA crate can provide a computation capability equivalent to hundreds of commodity PCs for Hades online particle track reconstruction and Cherenkov ring recognition.
  •  
48.
  • Liu, M., et al. (författare)
  • A survey of FPGA dynamic reconfiguration design methodology and applications
  • 2012
  • Ingår i: International Journal of Embedded and Real-Time Communication Systems. - : IGI Global. - 1947-3176 .- 1947-3184. ; 3:2, s. 23-39
  • Forskningsöversikt (refereegranskat)abstract
    • FPGA Dynamic Partial Reconfiguration (DPR or PR) technology has emerged and become gradually mature in the recent years. It provides the Time-Division Multiplexing (TDM) capability in utilizing on-chip resources and leads to significant benefits in comparison with conventional static designs. However, the partially reconfigurable design process features additional complexity and technical requirements to the FPGA developers. Hence, PR design approaches are being widely explored and investigated to systematize the development methodology and ease the designers. In this paper, the authors collect several research and engineering projects in this area and present a survey of the design methodology and applications of PR. Research aspects are discussed in various hardware/software layers.
  •  
49.
  • Liu, Ming, 1982- (författare)
  • Adaptive Computing based on FPGA Run-time Reconfigurability
  • 2011
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In the past two decades, FPGA has been witnessed from its restricted use as glue logic towards real System-on-Chip (SoC) platforms. Profiting from the great development on semiconductor and IC technologies, the programmability of FPGAs enables themselves wide adoption in all kinds of aspects of embedded designs. Modern FPGAs provide the additional capability of being dynamically and partially reconfigured during the system run-time. The run-time reconfigurability enhances FPGA designs from the sole spatial to both spatial and temporal parallelism, providing more design flexibility for advanced system features. Adaptive computing delegates an advanced computing paradigm in which computation tasks and resources are intelligently managed in correspondence with conditional requirements. In this thesis, we investigate adaptive designs on FPGA platforms: We present a comprehensive and practical design framework for adaptive computing based on the FPGA run-time reconfigurability. It concerns several design key issues in different hardware/software layers, specifically hardware architecture, run-time reconfiguration technical support, OS and device drivers, hardware process scheduler, context switching as well as Inter-Process Communications (IPC). Targeting a special application of data acquisition (DAQ) and trigger systems in nuclear and particle physics experiments, we set up the data streaming model and conduct theoretical analysis on the adaptive system. Three application studies are employed to verify the proposed adaptive design framework: The first application demonstrates a peripheral controller adaptable system aiming at general embedded designs. Through dynamically loading/unloading a NOR flash memory controller and an SRAM controller, both flash memory and SRAM accesses may be accomplished with less resource consumption than in traditional static designs. In the second case, two real algorithm processing engines are adaptively time-multiplexed in the same reconfigurable slot for particle recognition computation. Experimental results reveal the reduced on-chip resource requirements, as well as an approximate processing capability of the peer static design. Taking advantage of the FPGA dynamic reconfigurability, we present in the third application a novel on-FPGA interconnection microarchitecture named RouterLess NoC (RL-NoC). RL-NoC employs the novel design concept of Move Logic Not Data (MLND), and significantly distinguishes itself from the existing interconnection architectures such as buses, crossbars or NoCs. It does not rely on routers to deliver packets hop by hop as canonical NoCs do, but buffers data packets in virtual channels and brings various nodes using run-time reconfiguration to produce or consume them. In comparison with canonical packet-switching NoCs, the routerless architecture features lower design complexity, less resource consumption, higher work frequency, more efficient power dissipation as well as comparable or even higher packet delivery efficiency. It is regarded as a promising interconnection approach in some design scenarios on FPGAs, especially for light-weight applications.
  •  
50.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-50 av 93

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy