SwePub - sökning: (db:Swepub) pers:(Jantsch Axe...

Numrering	Referens	Omslagsbild	Hitta
1.	Chen, Xiaowen, et al. (författare) Speedup Analysis of Data-parallel Applications on Multi-core NoCs 2009 Ingår i: Proceedings of the IEEE International Conference on ASIC (ASICON). - 9781424438686 ; , s. 105-108 Konferensbidrag (refereegranskat)abstract As more computing cores are integrated onto a single chip, the effect of network communication latency is becoming more and more significant on Multi-core Network-onChips (NoCs). For data-parallel applications, we study the model ofparallel speedup by including network communication latency in Amdahl's law. The speedup analysis considers the effect of network topology, network size, traffic model and computation/communication ratio. We also study the speedup efficiency. In our Multi-core NoC platform, a real data-parallel application, i.e. matrix multiplication, is used to validate the analysis. Our theoretical analysis and the application results show that the speedup improvement is nonlinear and the speedup efficiency decreases as the system size is scaled up. Such analysis can be used to guide architects and programmers to improve parallel processing efficiency by reducing network latency with optimized network design and increasing computation proportion in the program.
2.	Grange, Matt, et al. (författare) Physical mapping and performance study of a multi-clock 3-Dimensional Network-on-Chip mesh 2009 Ingår i: 2009 IEEE INTERNATIONAL CONFERENCE ON 3D SYSTEMS INTEGRATION. - San Francisco : IEEE conference proceedings. - 9781424445110 ; , s. 345-351 Konferensbidrag (refereegranskat)abstract The physical performance of a 3-Dimensional Network-on-Chip (NoC) mesh architecture employing through silicon vias (TSV) for vertical connectivity is investigated with a cycle-accurate RTL simulator. The physical latency and area impact of TSVs, switches, and the on-chip interconnect is evaluated to extract the maximum signaling speeds through the switches, horizontal and vertical network links. The relatively low parasitics of TSVs compared to the on-chip 2-D interconnect allow for higher signaling speeds between chip layers. The system-level impact on overall network performance as a result of clocking vertical packets at a higher rate through the TSV interconnect is simulated and reported.
3.	Jantsch, Axel, et al. (författare) Resource Allocation for QoS On-Chip Communication 2009 Ingår i: Networks-on-Chips: Theory and Practice. - : CRC Press. - 9781420079784 Bokkapitel (refereegranskat)
4.	Liu, Ming, 1982- (författare) A High-end Reconfigurable Computation Platform for Particle Physics Experiments 2008 Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract Modern nuclear and particle physics experiments run at a very high reaction rate and are able to deliver a data rate of up to hundred GBytes/s. This data rate is far beyond the storage and on-line analysis capability. Fortunately physicists have only interest in a very small proportion among the huge amounts of data. Therefore in order to select the interesting data and reject the background by sophisticated pattern recognition processing, it is essential to realize an efficient data acquisition and trigger system which results in a reduced data rate by several orders of magnitude. Motivated by the requirements from multiple experiment applications, we are developing a high-end reconfigurable computation platform for data acquisition and triggering. The system consists of a scalable number of compute nodes, which are fully interconnected by high-speed communication channels. Each compute node features 5 Xilinx Virtex-4 FX60 FPGAs and up to 10 GBytesDDR2 memory. A hardware/software co-design approach is proposed to develop custom applications on the platform, partitioning performance-critical calculation to the FPGA hardware fabric while leaving flexible and slow controls to the embedded CPU plus the operating system. The system is expected to be high-performance and general-purpose for various applications especially in the physics experiment domain. As a case study, the particle track reconstruction algorithm for HADES has been developed and implemented on the computation platform in the format of processing engines. The Tracking Processing Unit (TPU) recognizes peak bins on the projection plane and reconstructs particle tracks in realtime. Implementation results demonstrate its acceptable resource utilization and the feasibility to implement the module together with the sys-tem design on the FPGA. Experimental results show that the online track reconstruction computation achieves 10.8 - 24.3 times performance acceleration per TPU module when compared to the software solution on a Xeon2.4 GHz commodity server.
5.	Liu, Ming, et al. (författare) A Reconfigurable Design Framework for FPGA Adaptive Computing 2009 Ingår i: 2009 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS. - : IEEE. - 9781424452934 ; , s. 439-444 Konferensbidrag (refereegranskat)abstract Partial Reconfiguration (PR) offers the possibility to adaptively change part of the FPGA design without stopping the remaining system. In this paper, we present a comprehensive framework for adaptive computing, in which design key points of hardware processes, system interconnections, Operating Systems (OS), device drivers, scheduler software as well as context switching are respectively concerned in different hardware/software layers. A case study is discussed to demonstrate an example of swapping a Flash memory controller and an SRAM controller in response to diverse memory access needs. Result analysis reveals a more efficient resource utilization of 52.1% I/O pads, 86.5% LUTs and 81.3% Flip-Flops, when compared to the static design with same functionalities. A small reconfiguration overhead of context switching is measured within the range from hundreds of microseconds to milliseconds. Moreover, technical perspectives are analyzed and it is foreseen to obtain great benefits with the proposed design framework in object applications of particle physics experiments.
6.	Liu, Ming, et al. (författare) ATCA-based Computation Platform for Data Acquisition and Triggering in Particle Physics Experiments 2008 Ingår i: 2008 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE AND LOGIC APPLICATIONS, VOLS 1 AND 2. ; , s. 287-292 Konferensbidrag (refereegranskat)abstract An ATCA-based computation platform for data acquisition and trigger applications in nuclear and particle physics experiments has been developed. Each Compute Node (CN) which appears as a Field Replaceable Unit (FRU) in an ATCA shelf, features 5 Xilinx Virtex-4 FX60 FPGAs and up to 10 GBytes DDR2 memory. Connectivity is provided with 8 optical links and 5 Gigabit Ethernet ports, which are mounted on each board to receive data from detectors and forward results to outer shelves or PC farms with attached mass storage. Fast point-to-point on-board interconnections between FPGAs as well as the full-mesh shelf backplane provide flexibility and high bandwidth to partition algorithms and correlate results among them. The system represents a highly reconfigurable and scalable solution for multiple applications.
7.	Liu, Ming, et al. (författare) Hardware/Software co-design of a general-purpose computation platform in particle physics 2007 Ingår i: ICFPT 2007. - 9781424414710 ; , s. 177-183 Konferensbidrag (refereegranskat)abstract In this paper we present a hardware/software co-design based computation platform for online data processing in particle physics experiments. Our goal is to ease and accelerate the development and make it universal and scalable for multiple applications, on the premise of guaranteeing high communicating and processing capabilities. The entire computation network consists of quite a few interconnected compute nodes, each of which has multiple FPGAs to implement specific algorithms for data processing. High-speed communication features including RocketIO multi-gigabit transceiver and Gigabit Ethernet are supported by FPGAs to construct internal and external connections. An embedded Linux operating system is fitted on the PowerPC CPU core inside the Xilinx Virtex-4 FX FPGA. Thus programmers can access hardware resources via device drivers and write application programs to manage the system from the high level. Furthermore measurements have been executed using the development board to investigate both communicating and processing performances of the system. Results show that the computation platform is able to communicate at a UDP/IP data rate of around 400 Mbps per Ethernet link, and the event selection engine could process an event rate of 25%.
8.	Liu, Ming, et al. (författare) Run-time Partial Reconfiguration Speed Investigation and Architectural Design Space Exploration 2009 Ingår i: FPL 09. - 9781424438914 ; , s. 498-502 Konferensbidrag (refereegranskat)abstract Run-time Partial Reconfiguration (PR) speed is significant in applications especially when fast IP core switching is required. In this paper, we propose to use Direct Memory Access (DMA), Master (MST) burst, and a dedicated Block RAM (BRAM) cache respectively to reduce the reconfiguration time. Based on the Xilinx PR technology and the Internal Configuration Access Port (ICAP) primitive in the FPGA fabric, we discuss multiple design architectures and thoroughly investigate their performance with measurements for different partial bitstream sizes. Compared to the reference OPB_HWICAP and XPS_HWICAP designs, experimental results show that DMA_HWICAP and MST_HWICAP reduce the reconfiguration time by one order of magnitude, with little resource consumption overhead. The BRAM_HWICAP design can even approach the reconfiguration speed limit of the ICAP primitive at the cost of large Block RAM utilization.
9.	Liu, Ming, et al. (författare) System-on-an-FPGA Design for Real-time Particle Track Recognition and Reconstruction in Physics Experiments 2008 Ingår i: 11TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN - ARCHITECTURES, METHODS AND TOOLS. - LOS ALAMITOS : IEEE COMPUTER SOC. ; , s. 599-605 Konferensbidrag (refereegranskat)abstract In particle physics experiments, the momenta of charged particles are studied by observing their deflection in a magnetic field. Dedicated detectors measure the particle tracks and complex algorithms are required for track recognition and reconstruction. This CPU-intensive task is usually implemented as off-line software running on PC clusters. In this paper we present a system-on-chip design for the track recognition and reconstruction based on modern FPGA technologies. The basic principle of the algorithm is polled from software into the FPGA fabric. The fundamental architecture of the tracking processor is described in detail. Working as processing engines in compute nodes, the tracking processor contributes to recognize potential track candidates in real-time and promotes the selection efficiency of the data acquisition and trigger system. Our design study shows that the tracking module can be integrated in a single Xilinx Virtex-4 FX60 FPGA. The processing capability of the design is about 16.7K sub-events per second per module with our experimental setup, which achieves 20 times speedup compared to the software implementation.
10.	Liu, Ming, et al. (författare) Trigger algorithm development on FPGA-based Compute Nodes 2009 Ingår i: 2009 16th IEEE-NPSS Real Time Conference. - New York : IEEE. - 9781424457960 ; , s. 478-484 Konferensbidrag (refereegranskat)abstract Based on the ATCA computation architecture and Compute Nodes (CN), investigation and implementation work has been being executed for HADES and PANDA trigger algorithms. We present our designs for HADES track reconstruction processing, Cherenkov ring recognition, Time-Of-Flight processing, electromagnetic shower recognition.. and the PANDA straw tube tracking algorithm. They will appear as co-processors in the uniform system design to undertake the detector-specific computing. The algorithm principles will be explained and hardware designs are described in the paper. The current progress reveals the feasibility to implement these algorithms on FPGAs. Also experimental results demonstrate the performance speedup when compared to alternative software solutions, as well as the potential capability of high-speed parallel/pipelined processing in Data Acquisition and Trigger systems.
11.	Lu, Zhonghai, et al. (författare) A Flow Regulator for On-Chip Communication 2009 Ingår i: IEEE INTERNATIONAL SOC CONFERENCE, PROCEEDINGS. - 9781424452200 ; , s. 151-154 Konferensbidrag (refereegranskat)abstract We have proposed (sigma, rho)-based flow regulation as a design instrument for System-on-Chip (SoC) architects to control quality-of-service and achieve cost-effective communication, where sigma bounds the traffic burstiness and rho the traffic rate. In this paper, we present a hardware implementation of the regulator. We discuss its microarchitecture. Based on this microarchitecture, we design, implement and synthesize a multi-flow regulator for AXI. Our experiments show the effectiveness of such a regulation device on the control of delay, jitter and buffer requirements.
12.	Lu, Zhonghai, et al. (författare) A power efficient flit-admission scheme for wormhole-switched networks on chip 2005 Ingår i: WMSCI 2005. - 9789806560567 ; , s. 25-30 Konferensbidrag (refereegranskat)abstract Reducing power consumption is a main challenge when adopting a network as a global on-chip communication interconnect since the reduction in power dissipation should not at the expense of degrading the system performance. We investigate power in a wormhole-switched network with focus on the impact of flit-admission schemes, i.e., when and how the flits of packets are admitted into the network We have proposed a novel flit-admission scheme that shows significant shrink of the switch complexity while maintaining equivalent network performance. This paper investigates its influence in network power involving both switches and links. We conduct experiments on a 2D mesh network. The results show that our flit-admission scheme achieves significant power and area reduction without performance penalty. To our knowledge, our work is the first study of power dissipation on flit admission schemes.
13.	Lu, Zhonghai, et al. (författare) Admitting and ejecting flits in wormhole-switched networks on chip 2007 Ingår i: Iet Computers and Digital Techniques. - : Institution of Engineering and Technology (IET). - 1751-8601. ; 1:5, s. 546-556 Tidskriftsartikel (refereegranskat)abstract Reducing the design complexity of switches is essential for cost reduction and power saving in on-chip networks. In wormhole-switched networks, packets are split into flits which are then admitted into and delivered in the network. When reaching destinations, flits are ejected from the network. Since flit admission, flit delivery and flit ejection interfere with each other directly and indirectly, techniques for admitting and ejecting flits exert a significant impact on network performance and switch cost. Different flit-admission and flit-ejection micro-architectures are investigated. In particular, for flit admission, a novel coupling scheme which binds a flit-admission queue with a physical channel (PC) is presented. This scheme simplifies the switch crossbar from 2p x p to (p + 1) x p, where p is the number of PCs per switch. For flit ejection, a p-sink model that uses only p flit sinks to eject flits is proposed. In contrast to an ideal ejection model which requires p . v flit sinks (v is the number of virtual channels per PC), the buffering cost of flit sinks becomes independent of v. The proposed flit-admission and flit-ejection schemes are evaluated with both uniform and locality traffic in a 2D 4 x 4 mesh network. The results show that both schemes do not degrade network performance in terms of average packet latency and throughput if the flit injection rate is slower than 0.57 flit/cycle/node.
14.	Lu, Zhonghai, et al. (författare) Cluster-based simulated annealing for mapping cores onto 2D mesh networks on chip 2008 Ingår i: 2008 IEEE Workshop On Design And Diagnostics Of Electronic Circuits And Systems, Proceedings. - 9781424422760 ; , s. 92-97 Konferensbidrag (refereegranskat)abstract In Network-on-Chip (NoC) application design, core-to-node mapping is an important but intractable optimization problem. In the paper, we use simulated annealing to tackle the mapping problem in 2D mesh NoCs. In particular, we combine a clustering technique with the simulated annealing to speed up the convergence to near-optimal solutions. The clustering exploits the connectivity and distance relation in the network architecture as well as the locality and bandwidth requirements in the core communication graph. The annealing is cluster-aware and may be dynamically constrained within clusters. Our experiments suggest that simulated annealing can be effectively used to solve the mapping problem with a scalable size, and the combined strategy improves over the simulated annealing in execution time by up to 30% without compromising the quality of solutions.
15.	Lu, Zhonghai, et al. (författare) Connection-oriented multicasting in wormhole-switched networks on chip 2006 Ingår i: IEEE Computer Society Annual Symposium on VLSI, Proceedings - EMERGING VLSI TECHNOLOGIES AND ARCHITECTURES. ; , s. 205-210 Konferensbidrag (refereegranskat)abstract Network-on-Chip (NoC) proposes networks to replace buses as a scalable global communication interconnect for future SoC designs. However, a bus is very efficient in broadcasting. As the system size scales up to explore the chip capacity, broadcasting in NoCs must be efficiently supported. This paper presents a novel multicast scheme in wormhole-switched NoCs. By this scheme, a multicast procedure consists of establishment, communication and release phase. A multicast group can request to reserve virtual channels during establishment and has priority on arbitration of link bandwidth. This multicasting method has been effectively implemented in a mesh network with dead-lock freedom. Our experiments show that the multicast technique improves throughput, and does not exhibit significant impact on unicast performance in a network with mixed unicast and multicast traffic if the network is not saturated.
16.	Lu, Zhonghai, et al. (författare) Connection-oriented multicasting in wormhole-switched networks on chip 2006 Ingår i: Proceedings of the 16th ACM Great Lakes symposium on VLSI. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 296-301 Konferensbidrag (refereegranskat)abstract Deflection routing is being proposed for networks on chips since it is simple and adaptive. A deflection switch can be much smaller and faster than a wormhole or virtual cut-through switch. A deflection-routed network has three orthogonal characteristics: topology, routing algorithm and deflection policy. In this paper we evaluate deflection networks with different topologies such as mesh, torus and Manhattan Street Network, different routing algorithms such as random, dimension XY, delta XY and minimum deflection, as well as different deflection policies such as non-priority, weighted priority and straight-through policies. Our results suggest that the performance of a deflection network is more sensitive to its topology than the other two parameters. It is less sensitive to its routing algorithm, but a routing algorithm should be minimal. A priority-based deflection policy that uses global and history-related criterion can achieve both better average-case and worst-case performance than a non-priority or priority policy that uses local and stateless criterion. These findings are important since they can guide designers to make right decisions on the deflection network architecture, for instance, selecting a routing algorithm or deflection policy which has potentially low cost and high speed for hardware implementation.
17.	Lu, Zhonghai (författare) Design and Analysis of On-Chip Communication for Network-on-Chip Platforms 2007 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract Due to the interplay between increasing chip capacity and complex applications, System-on-Chip (SoC) development is confronted by severe challenges, such as managing deep submicron effects, scaling communication architectures and bridging the productivity gap. Network-on-Chip (NoC) has been a rapidly developed concept in recent years to tackle the crisis with focus on network-based communication. NoC problems spread in the whole SoC spectrum ranging from specification, design, implementation to validation, from design methodology to tool support. In the thesis, we formulate and address problems in three key NoC areas, namely, on-chip network architectures, NoC network performance analysis, and NoC communication refinement. Quality and cost are major constraints for micro-electronic products, particularly, in high-volume application domains. We have developed a number of techniques to facilitate the design of systems with low area, high and predictable performance. From flit admission and ejection perspective, we investigate the area optimization for a classical wormhole architecture. The proposals are simple but effective. Not only offering unicast services, on-chip networks should also provide effective support for multicast. We suggest a connection-oriented multicasting protocol which can dynamically establish multicast groups with quality-of-service awareness. Based on the concept of a logical network, we develop theorems to guide the construction of contention-free virtual circuits, and employ a back-tracking algorithm to systematically search for feasible solutions. Network performance analysis plays a central role in the design of NoC communication architectures. Within a layered NoC simulation framework, we develop and integrate traffic generation methods in order to simulate network performance and evaluate network architectures. Using these methods, traffic patterns may be adjusted with locality parameters and be configured per pair of tasks. We propose also an algorithm-based analysis method to estimate whether a wormhole-switched network can satisfy the timing constraints of real-time messages. This method is built on traffic assumptions and based on a contention tree model that captures direct and indirect network contentions and concurrent link usage. In addition to NoC platform design, application design targeting such a platform is an open issue. Following the trends in SoC design, we use an abstract and formal specification as a starting point in our design flow. Based on the synchronous model of computation, we propose a top-down communication refinement approach. This approach decouples the tight global synchronization into process local synchronization, and utilizes synchronizers to achieve process synchronization consistency during refinement. Meanwhile, protocol refinement can be incorporated to satisfy design constraints such as reliability and throughput. The thesis summarizes the major research results on the three topics.
18.	Lu, Zhonghai, et al. (författare) Feasibility analysis of messages for on-chip networks using wormhole routing 2005 Ingår i: PROCEEDINGS OF THE ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, VOLS 1 AND 2. - New York, New York, USA : IEEE conference proceedings. ; , s. 960-964 Konferensbidrag (refereegranskat)abstract The feasibility of a message in a network concerns if its timing property can be satisfied without jeopardizing any messages already in the network to meet their timing properties. We present a novel feasibility analysis for real-time (RT) and non-realtime (NT) messages in wormhole-routed networks on chip. For RT messages, we formulate a contention tree that captures contentions in the network. For coexisting RT and NT messages, we propose a simple bandwidth partitioning method that allows us to analyze their feasibility independently.
19.	Lu, Zhonghai, et al. (författare) Flow Regulation for On-Chip Communication 2009 Ingår i: DATE. - 9781424437818 ; , s. 578-581 Konferensbidrag (refereegranskat)abstract We propose (sigma, rho)-based flow regulation as a design instrument for System-on-Chip (SoC) architects to control quality-of-service and achieve cost-effective communication, where sigma bounds the traffic burstiness and rho the traffic rate. This regulation changes the burstiness and timing of traffic flows, and can be used to decrease delay and reduce buffer requirements in the SoC infrastructure. In this paper, we define and analyze the regulation spectrum, which bounds the upper and lower limits of regulation. Experiments on a Network-on-Chip (NoC) with guaranteed service demonstrate the benefits of regulation We conclude that flow regulation may exert significant positive impact on communication performance and buffer requirements.
20.	Lu, Zhonghai, et al. (författare) Layered switching for networks on chip 2007 Ingår i: 2007 44th ACM/IEEE Design Automation Conference, Vols 1 And 2. - 9781595937711 ; , s. 122-127 Konferensbidrag (refereegranskat)abstract We present and evaluate a novel switching mechanism called layered switching. Conceptually, the layered switching implements wormhole on top of virtual cut-through switching. To show the feasibility of layered switching, as well as to confirm its advantages, we conducted an RTL implementation study based on a canonical wormhole architecture. Synthesis results show that our strategy suggests negligible degradation in hardware speed (1%) and area overhead (7%). Simulation results demonstrate that it achieves higher throughput than wormhole alone while significantly reducing the buffer space required at network nodes when compared with virtual cut-through.
21.	Lu, Zhonghai, et al. (författare) Network-on-Chip Benchmarking Specification Part 2 : Micro-Benchmark Specification 2008 Rapport (övrigt vetenskapligt/konstnärligt)
22.	Lu, Zhonghai, et al. (författare) Network-on Chip Micro-Benchmarks 2008 Ingår i: Embedded Systems Design. ; :September Tidskriftsartikel (refereegranskat)abstract The rapid development of Network-on-Chip (NoC) calls for a systematic approach to evaluate and fairly compare various NoC architectures. In this specification, we define a generic NoC architecture, a comprehensive set of synthetic workloads as micro-benchmarks, workload scenarios and evaluation criteria. These micro-benchmarks enable measuring particular properties of NoC architectures, complementing application benchmarks.
23.	Lu, Zhonghai, et al. (författare) NNSE: Nostrum Network-on-Chip Simulation Environment 2005 Ingår i: Proceedings of Swedish System-on-Chip Conference, Stockholm, Sweden, April 2005.. Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract A main challenge for Network-on-Chip (NoC) design isto select a network architecture that suits a particular application.NNSE enables to analyze the performance impactof NoC configuration parameters. It allows one to(1) configure a network with respect to topology, flow controland routing algorithm etc.; (2) configure various regularand application specific traffic patterns; (3) evaluatethe network with the traffic patterns in terms of latency and throughput.
24.	Lu, Zhonghai, et al. (författare) Refinement of A Perfectly Synchronous Communication Model onto Nostrum NoC Best-Effort Communication Service 2005 Ingår i: Proceedings of the Forum on Design Languages. Konferensbidrag (refereegranskat)
25.	Lu, Zhonghai, et al. (författare) Refining synchronous communication onto network-on-chip best-effort services 2006 Ingår i: Applications of Specification and Design Languages for SoCs. - DORDRECHT : Springer. - 1402049978 ; , s. 23-38 Konferensbidrag (refereegranskat)abstract We present a novel approach to refine a system model specified with perfectly synchronous communication onto a network-on-chip (NoC) best-effort communication service. It is a top-down procedure with three steps, namely, channel refinement, process refinement, and communication mapping. In channel refinement, synchronous channels are replaced with stochastic channels abstracting the best-effort service. In process refinement, processes are refined in terms of interfaces and synchronization properties. Particularly, we use synchronizers to maintain local synchronization of processes and thus achieve synchronization consistency, which is a key requirement while mapping a synchronous model onto an asynchronous architecture. Within communication mapping, the refined processes and channels are mapped to an NoC architecture. Adopting the Nostrum NoC platform as target architecture, we use a digital equalizer as a tutorial example to illustrate the feasibility of our concepts.
26.	Lu, Zhonghai, et al. (författare) Slot allocation using logical networks for TDM virtual-circuit configuration for network-on-chip 2007 Ingår i: IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD. - 9781424413812 ; , s. 18-25 Konferensbidrag (refereegranskat)abstract Configuring Time-Division-Multiplexing (TDM) Virtual Circuits (VCs) for network-on-chip must guarantee conflict freedom for overlapping VCs besides allocating sufficient time slots to them. These requirements are fulfilled in the slot allocation phase. In the paper, we define the concept of a logical network (LN). Based on this concept, we develop and prove theorems that constitute sufficient and necessary conditions to establish conflict-free VCs. Using these theorems, slot allocation for VCs becomes a procedure of computing LNs and then assigning VCs to different LNs. TDM VC configuration can thus be predictable and correct-by-construction. We have integrated this slot allocation method into our multi-node VC configuration program and applied the program to an industrial application.
27.	Lu, Zhonghai, et al. (författare) TDM virtual-circuit configuration for network-on-chip 2008 Ingår i: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - 1063-8210 .- 1557-9999. ; 16:8, s. 1021-1034 Tidskriftsartikel (refereegranskat)abstract In network-on-chip (NoC), time-division-multiplexing (TDM) virtual circuits (VCs) have been proposed to satisfy the quality-of-service requirements of applications. TDM VC is a connection-oriented communication service by which two or more connections take turns to share buffers and link bandwidth using dedicated time slots. In the paper, we first give a formulation of the multinode VC configuration problem for arbitrary NoC topologies. A multinode VC allows multiple source and destination nodes on it. Then we address the two problems of path selection and slot allocation for TDM VC configuration. For the path selection, we use a backtracking algorithm to explore the path diversity, constructively searching the solution space. In the slot allocation phase, overlapped VCs must be configured such that no conflict occurs and their bandwidth requirements are satisfied. We define the concept of a logical network (LN) as an infinite set of associated (time slot, buffer) pairs with respect to a buffer on a given VC. Based on this concept, we develop and prove theorems that constitute sufficient and necessary conditions to establish conflict-free VCs. They are applicable for networks where all nodes operate with the same clock frequency but allowing different phases. Using these theorems, slot allocation for VCs is a procedure of assigning VCs to different LNs. TDM VC configuration can thus be predictable and correct-by-construction. Our experiments on synthetic and real applications validate the effectiveness and efficiency of our approach.
28.	Lu, Zhonghai, et al. (författare) Towards performance-oriented pattern-based refinement of synchronous models onto NoC communication 2006 Ingår i: DSD 2006: 9th EUROMICRO Conference on Digital System Design: Architectures, Methods and Tools, Proceedings. - 0769526098 ; , s. 37-44 Konferensbidrag (refereegranskat)abstract We present a performance-oriented refinement approach that refines a perfectly synchronous communication model onto Network-on-Chip (NoC) communication. We first identify four basic forms of NoC process interaction patterns at the process level, namely, producer-consumer, peers, client-server and multicast. We propose a three-step top-down refinement method: channel refinement, protocol refinement and channel mapping. For the producer-consumer pattern, we describe it in detail. In channel refinement, we deal with interfacing multiple clock domains and use a stochastic process to model channel delay and jitter In protocol refinement, we show how to refine communication towards application requirements such as reliability and throughput. In channel mapping, we discuss channel convergence and channel merge arising from channel overlapping. All the refinements have been conducted and validated as an integral design phase towards implementation in ForSyDe, a formal system-level design methodology based on a synchronous model of computation.
29.	Lu, Zhonghai, et al. (författare) Traffic configuration for evaluating networks on chips 2005 Ingår i: Fifth International Workshop on System-on-Chip for Real-Time Applications, Proceedings. - : IEEE Computer Society. - 0769524036 ; , s. 535-540 Konferensbidrag (refereegranskat)abstract Network-on-Chip (NoC) provides a network as a global communication platform for future SoC designs. Evaluating network architectures requires both synthetic workloads and application-oriented traffic. We present our traffic configuration methods that can be used to configure uniform and locality traffic as synthetic workloads, and to configure channel-based traffic for specific application(s). We also illustrate the significance of applying these methods to configure traffic for network evaluation and system simulation. These traffic configuration methods have been integrated into our Nostrum NoC simulation environment.
30.	Lu, Zhonghai, et al. (författare) Trends of Terascale Computing Chips in the Next Ten Years 2009 Ingår i: 2009 IEEE 8TH INTERNATIONAL CONFERENCE ON ASIC, VOLS 1 AND 2, PROCEEDINGS. - NEW YORK : IEEE. ; , s. 62-66 Konferensbidrag (refereegranskat)abstract Moore's law steadily continues though facing a number of challenges. This paper identifies ongoing and desirable trends to exploit the technology capacity and flirt her Moore 's law for terascale on-chip computing architectures in the next ten years. Four foreseeable trends are: from single core to many cores, from bus-based to network-based interconnect, from centralized memory to distributed memory, and from 2D integration to 3D integration. We motivate these trends and show that the number of design choices for computing chips is increasing rapidly, leading to an exploding design space with uncountable opportunities for the innovative architect. Moreover, we envision that the multicore Network-on-Chip will become an infrastructure backbone and accumulate many other infrastructural functions such as memory, power and resource management, testing and diagnostic services.
31.	Lu, Zhonghai, et al. (författare) Using synchronizers for refining synchronous communication onto Hardware/Software architectures 2007 Ingår i: RSP 2007. - : IEEE Computer Society. - 9780769528342 ; , s. 143-149 Konferensbidrag (refereegranskat)abstract We have presented a formal set of synchronization components called synchronizers for refining synchronous communication onto HW/SW codesign architectures. Such an architecture imposes asynchronous communication between HW-HW SW-SW and HW-SW components. The synchronizers enable local synchronization, thus satisfy the synchronization requirement of a typical IP core. In this paper we present their implementations in HW, SW and HW/SW as well as their application. To validate our concepts, we conduct a case study on a Nios FPGA that comprises a processor memory and custom logic. The final HW/SW implementation achieves equivalent performance to pure HW implementation. Our prototyping experience suggests that the synchronizers can be standardized as library modules and effectively separate the design of computation from that of communication.
32.	Naeem, Abdul, et al. (författare) Scalability of Relaxed Consistency Models in NoC based Multicore Architectures 2009 Ingår i: SIGARCH Computer Architecture News. - : ACM Press. - 0163-5964 .- 1943-5851. ; 37:5, s. 8-15 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract This paper studies realization of relaxed memory consistency models in the network-on-chip based distributed shared memory (DSM) multi-core systems. Within DSM systems, memory consistency is a critical issue since it affects not only the performance but also the correctness of programs. We investigate the scalability of the relaxed consistency models (weak, release consistency) implemented by using transaction counters. Our experimental results compare the average and maximum code, synchronization and data latencies of the two consistency models for various network sizes with regular mesh topologies. The observed latencies rise for both the consistency models as the network size grows. However, the scaling behaviors are different. With the release consistency model these latencies grow significantly slower than with the weak onsistency due to better optimization potential by means of overlapping, reordering and program order relaxations. The release consistency improves the performance by 15.6% and 26.5% on average in the code and consistency latencies over the weak consistency model for the specific application, as the system grows from single core to 64 cores. The latency of data transactions rows 2.2 times faster on the average with a weak consistency model than with a release consistency model when the system scales from single core to 64 cores.
33.	She, Huimin, et al. (författare) A Network-based System Architecture for Remote Medical Applications 2007 Ingår i: Proceedings of the Asia-Pacific Advanced Network Meeting. Konferensbidrag (refereegranskat)abstract Nowadays, the evolution of wireless communication and networktechnologies enables remote medical services to be availableeverywhere in the world. In this paper, a network-based systemarchitecture adopting wireless personal area network (WPAN)protocol IEEE 802.15.4/Zigbee standard and 3G communicationnetworks for remote medical applications is proposed. In theproposed system, the number and type of medical sensors arescalable depending on individual needs. This feature allows thesystem to be flexibly applied in several medical applications.Furthermore, a differentiated service using priority scheduling anddata compression is introduced. This scheme can not only reducetransmission delay for critical physiological signals and enhancebandwidth utilization at the same time, but also decrease powerconsumption of the hand-held personal server which uses batteryas the energy source.
34.	She, Huimin, et al. (författare) Analysis of Traffic Splitting Mechanisms for 2D Mesh Sensor Networks 2008 Ingår i: International Journal of Software Engineering and Its Applications. - 1738-9984. ; 2:3 Tidskriftsartikel (refereegranskat)abstract For many applications of sensor networks, it is essential to ensure that messages aretransmitted to their destinations within delay bounds and the buffer size of each sensor nodeis as small as possible. In this paper, we firstly introduce the system model of a mesh sensornetwork. Based on this system model, the expressions for deriving the delay bound and bufferrequirement bound are presented using network calculus theory. In order to balance trafficload and improve resource utilization, three traffic splitting mechanisms are proposed. Andthe two bounds are derived in these traffic splitting mechanisms. To show how our methodapplies to real applications, we conduct a case study on a fresh food tracking application,which monitors the food freshness status in real-time during transportation. The numericalresults show that the delay bound and buffer requirement bound are reduced while applyingtraffic splitting mechanisms. Thus the performance of the whole sensor network is improvedwith less cost.
35.	She, Huimin, et al. (författare) Analytical Evaluation of Retransmission Schemes in Wireless Sensor Networks 2009 Ingår i: 2009 IEEE VEHICULAR TECHNOLOGY CONFERENCE. - 9781424425167 ; , s. 38-42 Konferensbidrag (refereegranskat)abstract Retransmission has been adopted as one of the most popular schemes for improving transmission reliability in wireless sensor networks. Many previous works have been done on reliable transmission issues in experimental ways, however, there still lack of analytical techniques to evaluate these solutions. Based on the traffic model, service model and energy model, we propose an analytical method to analyze the delay and energy metrics of two categories of retransmission schemes: hop-by-hop retransmission (HBH) and end-to-end retransmission (ETE). With the experiment results, the maximum packet transfer delay and energy efficiency of these two scheme are compared in several scenarios. Moreover, the analytical results of transfer delay are validated through simulations. Our experiments demonstrate that HBH has less energy consumption at the cost of lager transfer delay compared with ETE. With the same target success probability, ETE is superior on the delay metric for low bit-error-rate (BER) cases, while HBH is superior for high BER cases.
36.	She, Huimin, et al. (författare) Deterministic Worst-case Performance Analysis for Wireless Sensor Networks 2008 Ingår i: Proceedings of the International Wireless Communications and Mobile Computing Conference. - 9781424422029 ; , s. 1081-1086 Konferensbidrag (refereegranskat)abstract Dimensioning wireless sensor networks requires formal methods to guarantee network performance and cost in any conditions. Based on network calculus, this paper presents a deterministic analysis method for evaluating the worst-case performance and buffer cost of sensor networks. To this end, we introduce three general traffic flow operators and derive their delay and buffer bounds. These operators are general because they can be used in combination to model any complex traffic flowing scenarios in sensor networks. Furthermore, our method integrates variable duty cycle to allow the sensor nodes to operate at lower rates thus saving power. Moreover, it incorporates traffic splitting mechanisms in order to balance network workload and nodes' buffers. To show how our method applies to real applications, we conduct a case study on a fresh food tracking application, which monitors the food freshness in realtime. The experimental results demonstrate that our method can be either used to perform network planning before deployment, or to conduct network reconfiguration after deployment.
37.	She, Huimin, et al. (författare) Traffic splitting with network calculus for mesh sensor networks 2007 Ingår i: Proceedings of Future Generation Communication and Networking, FGCN 2007. - : IEEE Computer Society. - 9780769530482 ; , s. 371-376 Konferensbidrag (refereegranskat)abstract In many applications of sensor networks, it is essential to ensure that messages are transmitted to their destinations as early as possible and the buffer size of each sensor node is as small as possible. In this paper, we firstly propose a mesh sensor network system model. Based on this system model, the expressions for deriving the delay bound and buffer requirement bound are presented using network calculus. In order to balance traffic load and improve resource utilization, three traffic splitting mechanisms are proposed The numerical results show that the delay bound and buffer requirement bound are lowered while applying those traffic splitting mechanisms. And thus the performance of the whole sensor network is improved.
38.	Wang, Qiang, et al. (författare) Hardware/Software Co-design of an ATCA-based Computation Platform for Data Acquisition and Triggering 2009 Ingår i: 16th IEEE NPSS Real Time Conference. - 9781424457960 ; , s. 485-489 Konferensbidrag (refereegranskat)abstract An ATCA-based computation platform for data acquisition and trigger(TDAQ) applications has been developed for multiple future projects such its PANDA. HADES, and BESIII. Each Compute Node (CN) appears as one (if the fourteen Field Replaceable Units (FRU) in an ATCA shelf, which in total features a high performance of 1890 Clips inter-FPGA on-board channels, 1456 Gbps inter-board backplane connections, 728 Gbps full-duplex optical links, 70 Gbps Ethernet. 140 GBytes DDR2 SDRAM. and all computing resources of 70 Xilinx Virtex-4 FX60 FPGAs. Corresponding to (the system architecture, a hardware/software co-design approach is proposed to ease and accelerate the development for different experiments. In the uniform system design. application-specific computation is to be implemented as customized hardware co-processors, while the embedded PowerPC processor takes charge of flexible slow controls and transmission protocol processing.
39.	Weldezion, Awet Yemane, et al. (författare) Scalability of Network-on-Chip Communication Architecture for 3-D Meshes 2009 Ingår i: 2009 3RD ACM/IEEE INTERNATIONAL SYMPOSIUM ON NETWORKS-ON-CHIP. - NEW YORK : IEEE. - 9781424441426 ; , s. 114-123 Konferensbidrag (refereegranskat)abstract Design Constraints imposed by global interconnect delays as well as limitations in integration of disparate technologies make 3-D chip stacks an enticing technology solution for massively integrated electronic systems. The scarcity of vertical interconnects however imposes special constraints on the design of the communication architecture. This article examines the performance and scalability of different communication topologiesfor 3-D Network-on-Chips (NoC) using Through-Silicon-Was (TSV) for inter-die connectivity. Cycle accurate RTL-level simulations are conducted for two communication schemes based on a 7-port switch and a centrally arbitrated vertical bus using different traffic patterns. The scalability of the 3-D NoC is examined under both communication architectures and compared to 2-D NoC structures in terms of throughput and latency in order to quantify the variation of network performance with the number of nodes and derive key design guidelines.
40.	Wolf, Pieter van der, et al. (författare) Definition of Device Level Interface with QoS : Draft Specification 2007 Rapport (övrigt vetenskapligt/konstnärligt)abstract The extensions to standard IP communication interfaces proposed in SPRINT WP3document D3.1 are defined. Flow identification signals are added to the DLI signal level interface so transactionscan indicate the services they require. These services are specified as Contracts thatdefine the flow characteristics required for correct operation. These characteristics arethe main input to an analysis method to validate that a SoC design achieves its performance targets. DLI-Guard units are defined that enforce Contracts by regulating an IP module’s identified flows. Monitoring of flow characteristics, such as latency, is also optionally provided. A configuration API for DLI-Guards is outlined together with example code toillustrate its use. This specification is successfully applied to AMBA AXI, the prime example DLI
41.	Zhang, Yuang, et al. (författare) Towards Hierarchical Cluster based Cache Coherence for Large-Scale Network-on-Chip 2009 Ingår i: DTIS. ; , s. 119-122 Konferensbidrag (refereegranskat)abstract We introduce a novel hierarchical cluster based cache coherence scheme for large-scale NoC based distributed memory architectures. We describe the hierarchical memory organization. We show analytically that the proposed scheme has better performance than traditional counterparts both in memory overhead and communication cost.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Träfflista för sökning "(db:Swepub) pers:(Jantsch Axel) pers:(Lu Zhonghai) srt2:(2005-2009) "

Avgränsa träffmängd

År