↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Träfflista för sökning "WFRF:(Sourdis Ioannis 1979) "

Sökning: WFRF:(Sourdis Ioannis 1979)

Resultat 1-50 av 66

Sortera/gruppera träfflistan

Sortering: Träffar per sida:

Numrering	Referens	Omslagsbild	Hitta
1.	Alvarez, Lluc, et al. (författare) eProcessor: European, Extendable, Energy-Efficient, Extreme-Scale, Extensible, Processor Ecosystem 2023 Ingår i: Proceedings of the 20th ACM International Conference on Computing Frontiers 2023, CF 2023. ; , s. 309-314 Konferensbidrag (refereegranskat)abstract The eProcessor project aims at creating a RISC-V full stack ecosystem. The eProcessor architecture combines a high-performance out-of-order core with energy-efficient accelerators for vector processing and artificial intelligence with reduced-precision functional units. The design of this architecture follows a hardware/software co-design approach with relevant application use cases from the high-performance computing, bioinformatics and artificial intelligence domains. Two eProcessor prototypes will be developed based on two fabricated eProcessor ASICs integrated into a computer-on-module.
2.	Brokalakis, A., et al. (författare) COSSIM: An open-source integrated solution to address the simulator gap for systems of systems 2018 Ingår i: Proceedings - 21st Euromicro Conference on Digital System Design, DSD 2018. - 9781538673768 ; , s. 115-120 Konferensbidrag (refereegranskat)abstract In an era of complex networked heterogeneous systems, simulating independently only parts, components or attributes of a system under design is not a viable, accurate or efficient option. The interactions are too many and too complicated to produce meaningful results and the optimization opportunities are severely limited when considering each part of a system in an isolated manner. The presented COSSIM simulation framework is the first known open-source, high-performance simulator that can handle holistically system-of-systems including processors, peripherals and networks; such an approach is very appealing to both CPS/IoT and Highly Parallel Heterogeneous Systems designers and application developers. Our highly integrated approach is further augmented with accurate power estimation and security sub-tools that can tap on all system components and perform security and robustness analysis of the overall networked system. Additionally, a GUI has been developed to provide easy simulation set-up, execution and visualization of results. COSSIM has been evaluated using real-world applications representing cloud (mobile visual search) and CPS systems (building management) demonstrating high accuracy and performance that scales almost linearly with the number of CPUs dedicated to the simulator.
3.	Mavroidis, Iakovos, et al. (författare) ECOSCALE: Reconfigurable computing and runtime system for future exascale systems 2016 Ingår i: 19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016, Dresden, Germany, 14-18 March 2016. - 1530-1591. - 9783981537062 ; , s. 696-701 Konferensbidrag (refereegranskat)abstract In order to reach exascale performance, current HPC systems need to be improved. Simple hardware scaling is not a feasible solution due to the increasing utility costs and power consumption limitations. Apart from improvements in implementation technology, what is needed is to refine the HPC application development flow as well as the system architecture of future HPC systems. ECOSCALE tackles these challenges by proposing a scalable programming environment and architecture, aiming to substantially reduce energy consumption as well as data traffic and latency. ECOSCALE introduces a novel heterogeneous energy-efficient hierarchical architecture, as well as a hybrid many-core+OpenCL programming environment and runtime system. The ECOSCALE approach is hierarchical and is expected to scale well by partitioning the physical system into multiple independent Workers (i.e. compute nodes). Workers are interconnected in a tree-like fashion and define a contiguous global address space that can be viewed either as a set of partitions in a Partitioned Global Address Space (PGAS), or as a set of nodes hierarchically interconnected via an MPI protocol. To further increase energy efficiency, as well as to provide resilience, the Workers employ reconfigurable accelerators mapped into the virtual address space utilizing a dual stage System Memory Management Unit with coherent memory access. The architecture supports shared partitioned reconfigurable resources accessed by any Worker in a PGAS partition, as well as automated hardware synthesis of these resources from an OpenCL-based programming model.
4.	Pericas, Miquel, 1979, et al. (författare) Preface 2022 Ingår i: Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors. - 1063-6862. ; 2022-July, s. IX- Konferensbidrag (övrigt vetenskapligt/konstnärligt)
5.	Ramakrishnan Geethakumari, Prajith, 1986, et al. (författare) Single Window Stream Aggregation using Reconfigurable Hardware 2017 Ingår i: 2017 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE TECHNOLOGY (ICFPT). - 9781538626566 ; 2018-January, s. 112-119 Konferensbidrag (refereegranskat)abstract High throughput and low latency stream aggregation - and stream processing in general - is critical for many emerging applications that analyze massive volumes of continuously produced data on-the-fly, to make real time decisions. In many cases, high speed stream aggregation can be achieved incrementally by computing partial results for multiple windows. However, for particular problems, storing all incoming raw data to a single window before processing is more efficient or even the only option. This paper presents the first FPGA-based single window stream aggregation design. Using Maxeler's dataflow engines (DFEs), up to 8 million tuples-per-second can be processed (1.1 Gbps) offering 1-2 orders of magnitude higher throughput than a state-of-the-art stream processing software system. DFEs have a direct feed of incoming data from the network as well as direct access to off-chip DRAM processing a tuple in less than 4 mu sec, 4 orders of magnitude lower latency than software. The proposed approach is able to support challenging queries required in realistic stream processing problems (e.g. holistic functions). Our design offers aggregation for up to 1 million concurrently active keys and handles large windows storing up to 6144 values (24 KB) per key.
6.	Tzilis, Stavros, 1982, et al. (författare) SWAS: Stealing Work Using Approximate System-Load Information 2017 Ingår i: 46th International Conference on Parallel Processing Workshops, ICPPW 2017, Bristol, United Kingdom, 14 August 2017. - 1530-2016. ; , s. 309-318 Konferensbidrag (refereegranskat)abstract This paper explores the potential of utilizing approximate system load information to enhance work stealing for dynamic load balancing in hierarchical multicore systems. Maintaining information about the load of a system has not been extensively researched since it is assumed to introduce performance overheads. We propose SWAS, a lightweight approximate scheme for retrieving and using such information, based on compact bit vector structures and lightweight update operations. This approximate information is used to enhance the effectiveness of work stealing decisions. Evaluating SWAS for a number of representative scenarios on a multi-socket multi-core platform showed that work stealing guided by approximate system load information achieves considerable performance improvements: up to 18.5% for dynamic, severely imbalanced workloads; and up to 34.4% for workloads with complex task dependencies, when compared with random work stealing.
7.	Athanasopoulos, E., et al. (författare) Increasing the Trustworthiness of Embedded Applications 2015 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Cham : Springer International Publishing. - 1611-3349 .- 0302-9743. ; 9229, s. 321-322 Konferensbidrag (refereegranskat)abstract Embedded systems, by their nature, often run unattended with opportunistic rather then scheduled software upgrades and, perhaps most significantly, have long operational lifetimes, and, hence, provide excellent targets for massive and remote exploitation. Thus, such systems mandate higher assurances of trust and cyber-security compared to those presently available in State-of-the-Art ICT systems. In this poster we present some techniques we utilize in the SHARCS project to ensure a higher level of security for embedded systems.
8.	Athanasopoulos, E., et al. (författare) Secure hardware-software architectures for robust computing systems 2015 Ingår i: Communications in Computer and Information Science. - Cham : Springer International Publishing. - 1865-0937 .- 1865-0929. - 9783319271637 ; 570, s. 209-212 Konferensbidrag (refereegranskat)abstract The Horizon 2020 SHARCS project is a framework for designing, building and demonstrating secure-by-design applications and services, that achieve end-to-end security for their users. In this paper we present the basic elements of SHARCS that will provide a powerful foundation for designing and developing trustworthy, secure-by-design applications and services for the Future Internet.
9.	Ejaz, Ahsen, 1986, et al. (författare) DDRNoC: Dual Data-Rate Network-on-Chip 2017 Rapport (övrigt vetenskapligt/konstnärligt)abstract This paper introduces DDRNoC, an on-chip interconnection network able to route packets at Dual Data Rate. The cycle time of current 2D-mesh Network-on-Chip routers is limited by their control as opposed to the datapath (switch and link traversal) which exhibits significant slack. DDRNoC capitalizes on this observation allowing two flits per cycle to share the same datapath. Thereby, DDRNoC achieves higher throughput than a Single Data Rate (SDR) network. Alternatively, using lower voltage circuits, the above slack can be exploited to reduce power consumption while matching the SDR network throughput. In addition, DDRNoC exhibits reduced clock distribution power, improving energy efficiency, as it needs a slower clock than a SDR network that routes packets at the same rate. Post place and route results in 28 nm technology show that, compared to an iso-voltage (1.1V) SDR network, DDRNoC improves throughput proportionally to the SDR datapath slack. Moreover, a low-voltage (0.95V) DDRNoC implementation converts that slack to power reduction offering the 1.1V SDR throughput at a substantially lower energy cost.
10.	Ejaz, Ahsen, 1986, et al. (författare) DDRNoC: Dual Data-Rate Network-on-Chip 2018 Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 15:2 Tidskriftsartikel (refereegranskat)abstract This article introduces DDRNoC, an on-chip interconnection network capable of routing packets at Dual Data Rate. The cycle time of current 2D-mesh Network-on-Chip routers is limited by their control as opposed to the datapath (switch and link traversal), which exhibits significant slack. DDRNoC capitalizes on this observation, allowing two flits per cycle to share the same datapath. Thereby, DDRNoC achieves higher throughput than a Single Data Rate (SDR) network. Alternatively, using lower voltage circuits, the above slack can be exploited to reduce power consumption while matching the SDR network throughput. In addition, DDRNoC exhibits reduced clock distribution power, improving energy efficiency, as it needs a slower clock than a SDR network that routes packets at the same rate. Post place and route results in 28nm technology show that, compared to an iso-voltage (1.1V) SDR network, DDRNoC improves throughput proportionally to the SDR datapath slack. Moreover, a low-voltage (0.95V) DDRNoC implementation converts that slack to power reduction offering the 1.1V SDR throughput at a substantially lower energy cost.
11.	Ejaz, Ahsen, 1986, et al. (författare) FastTrackNoC: A DDR NoC with FastTrack Router Datapaths 2021 Rapport (övrigt vetenskapligt/konstnärligt)abstract This paper introduces FastTrackNoC, a Network-on-Chip router architecture that reduces latency by bypassing its switch traversal (ST) stage. FastTrackNoC adds a fast-track path between the head of a particular virtual channel (VC) buffer at each input port and the link of the opposite output. This allows non-turning flits to bypass ST when the required router resources are available. FastTrackNoC combines ST bypassing with existing techniques for reducing latency, namely, pipeline bypassing of control stages, precomputed routing and lookahead control signaling, to allow at best a flit to proceed directly to link traversal (LT). FastTrackNoC is applied to a Dual Data Rate (DDR) router in order to maximize throughput. Post place and route results in 28nm technology show that (i) compared to the current state of the art DDR NoCs, FastTrackNoC offers the same throughput and reduces average packet latency by 11-32% requiring up to 5% more power and (ii) compared to current state of the art Single Data Rate (SDR) NoCs, FastTrackNoC reduces packet latency by 9-40% and achieves 16-19% higher throughput with 5% higher power at the SDR NoC saturation point.
12.	Ejaz, Ahsen, 1986, et al. (författare) FastTrackNoC: A NoC with FastTrack Router Datapaths 2022 Ingår i: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. ; 2022-April, s. 971-985 Konferensbidrag (refereegranskat)abstract This paper introduces FastTrackNoC, a Network-on-Chip (NoC) router architecture that reduces packet latency by bypassing its switch traversal (ST) stage. It is based on the observation that there is a bias in the direction a flit takes through a router, e.g., in a 2D mesh network, non-turning hops are preferred, especially when dimension order routing is used. FastTrackNoC capitalizes on this observation and adds to a 2D mesh router a fast-track path between the head of a single input virtual channel (VC) buffer and its most popular, opposite output. This allows non-turning flits to bypass ST logic, i.e., buffer-, input-and output multiplexing, when the required router resources are available. FastTrackNoC combines ST bypassing with existing techniques for reducing latency, namely, allocation bypassing, precomputed routing, and lookahead control signaling to allow at best incoming flits to proceed directly to link traversal (LT). Moreover, it is applied to a Dual Data Rate (DDR) router in order to maximize network throughput. Post place and route results in 28nm show the following: compared to previous DDR NoCs, FastTrackNoC offers 13-32% lower average packet latency; compared to previous multi-VC Single Data Rate (SDR) NoCs, FastTrackNoC reduces latency by 10-40% and achieves 18-21% higher throughput, and compared to single-channel SDR NoC offers up to 50% higher throughput and similar latency.
13.	Ejaz, Ahsen, 1986, et al. (författare) FreewayNoC: A DDR NoC with Pipeline Bypassing 2018 Ingår i: 2018 12th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2018. - 9781538648933 Konferensbidrag (refereegranskat)abstract This paper introduces FreewayNoC, a Network-on-chip that routes packets at Dual Data Rate (DDR) and allows pipeline bypassing. Based on the observation that routers datapath is faster than control, a recent NoC design allowed flits to be routed at DDR improving throughput to rates defined solely by the switch and link traversal, rather than by the control. However, such a DDR NoC suffers from high packet latency as flits require multiple cycles per hop. A common way to reduce latency at low traffic load is pipeline bypassing, then, flits that find a contention-free way to the output port can directly traverse the switch. Existing Single Data Rate (SDR) NoC routers support it, but applying pipeline bypassing to a DDR router is more challenging. It would need additional bypassing logic which would add to the cycle time compromising the DDR NoC throughput advantage. FreewayNoC design restricts pipeline bypassing on a DDR router to only flits that go straight simplifying its logic. Thereby, it offers low packet latency without affecting DDR router cycle time and throughput. Then, at low traffic loads, besides the few turns that a flit would take on its way from source to destination, all other hops could potentially offer minimum latency equal to the delay of the switch and link traversal. Post place and route results in 28 nm technology confirm the above and also show that zero-load latency scales to the hop count better than current state-of-the-art NoCs.
14.	Ejaz, Ahsen, 1986, et al. (författare) HighwayNoC: Approaching Ideal NoC Performance With Dual Data Rate Routers 2021 Ingår i: IEEE/ACM Transactions on Networking. - 1558-2566 .- 1063-6692. ; 29:1, s. 318-331 Tidskriftsartikel (refereegranskat)abstract This paper describes HighwayNoC,, a Network-on-chip (NoC) that approaches ideal network performance using a Dual Data Rate (DDR) datapath. Based on the observation that routers datapath is faster than control, a DDR NoC allows flits to be routed at DDR improving throughput to rates defined solely by the datapath, rather than by the control. DDR NoCs can use pipeline bypassing to reduce packet latency at low traffic load. However, existing DDR routers offer bypassing only on in-network, non-turning hops to simplify the required logic. HighwayNoC, extends bypassing support of DDR routers to local ports, allowing flits to enter and exit the network faster. Moreover, it simplifies the DDR switch allocation and the interface of router ports reducing power and area costs. Post place and route results in 28 nm technology show that HighwayNoC, performs better than current state of the art NoCs. Compared to previous DDR NoCs, HighwayNoC, reduces average packet latency by 7.3-27% and power consumption by 1-10%, without affecting throughput. Compared to existing Single Data Rate NoCs, HighwayNoC, achieves 17-22% higher throughput, has similar or up to 13.8% lower packet latency, and mixed energy efficiency results.
15.	Eldstål-Ahrens, Albin, 1988, et al. (författare) FlatPack: Flexible Compaction of Compressed Memory 2022 Ingår i: Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. - New York, NY, USA : ACM. - 1089-795X. ; , s. 96-108 Konferensbidrag (refereegranskat)abstract The capacity and bandwidth of main memory is an increasingly important factor in computer system performance. Memory compression and compaction have been combined to increase effective capacity and reduce costly page faults. However, existing systems typically maintain compaction at the expense of bandwidth. One major cause of extra traffic in such systems is page overflows, which occur when data compressibility degrades and compressed pages must be reorganized. This paper introduces FlatPack, a novel approach to memory compaction which is able to mitigate this overhead by reorganizing compressed data dynamically with less data movement. Reorganization is carried out by an addition to the memory controller, without intervention from software. FlatPack is able to maintain memory capacity competitive with current state-of-the-art memory compression designs, while reducing mean memory traffic by up to 67%. This yields average improvements in performance and total system energy consumption over existing memory compression solutions of 31-46% and 11-25%, respectively. In total, FlatPack improves on baseline performance and energy consumption by 108% and 40%, respectively, in a single-core system, and 83% and 23%, respectively, in a multi-core system.
16.	Eldstål-Ahrens, Albin, 1988, et al. (författare) L2C: Combining Lossy and Lossless Compression on Memory and I/O 2022 Ingår i: Transactions on Embedded Computing Systems. - : Association for Computing Machinery (ACM). - 1558-3465 .- 1539-9087. ; 21:1 Tidskriftsartikel (refereegranskat)abstract In this paper we introduce L2C, a hybrid lossy/lossless compression scheme applicable both to the memory subsystem and I/O traffic of a processor chip. L2C employs general-purpose lossless compression and combines it with state of the art lossy compression to achieve compression ratios up to 16:1 and improve the utilization of chip's bandwidth resources. Compressing memory traffic yields lower memory access time, improving system performance and energy efficiency. Compressing I/O traffic offers several benefits for resource-constrained systems, including more efficient storage and networking. We evaluate L2C as a memory compressor in simulation with a set of approximation-tolerant applications. L2C improves baseline execution time by an average of 50\%, and total system energy consumption by 16%. Compared to the lossy and lossless current state of the art memory compression approaches, L2C improves execution time by 9% and 26% respectively, and reduces system energy costs by 3% and 5%, respectively. I/O compression efficacy is evaluated using a set of real-life datasets. L2C achieves compression ratios of up to 10.4:1 for a single dataset and on average about 4:1, while introducing no more than 0.4% error.
17.	Eldstål-Ahrens, Albin, 1988, et al. (författare) MemSZ: Squeezing Memory Traffic with Lossy Compression 2020 Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 17:4 Tidskriftsartikel (refereegranskat)abstract This article describes Memory Squeeze (MemSZ), a new approach for lossy general-purpose memory compression. MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in our implementation. Our compressor is placed between the memory controller and the cache hierarchy of a processor to reduce the memory traffic of applications that tolerate approximations in parts of their data. Thereby, the available off-chip bandwidth is utilized more efficiently improving system performance and energy efficiency. Two alternative multi-core variants of the MemSZ system are described. The first variant has a shared last-level cache (LLC) on the processor-die, which is modified to store both compressed and uncompressed data. The second has a 3D-stacked DRAM cache with larger cache lines that match the granularity of the compressed memory blocks and stores only uncompressed data. For applications that tolerate aggressive approximation in large fractions of their data, MemSZ reduces baseline memory traffic by up to 81%, execution time by up to 62%, and energy costs by up to 25% introducing up to 1.8% error to the application output. Compared to the current state-of-the-art lossy memory compression design, MemSZ improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.
18.	Eldstål Damlin, Albin, 1988, et al. (författare) AVR: Reducing Memory Traffic with Approximate Value Reconstruction 2019 Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : ACM. ; 5 August 2019 Konferensbidrag (refereegranskat)abstract This paper describes Approximate Value Reconstruction (AVR), an architecture for approximate memory compression. AVR reduces the memory traffic of applications that tolerate approximations in their dataset. Thereby, it utilizes more efficiently off-chip bandwidth improving significantly system performance and energy efficiency. AVR compresses memory blocks using low latency downsampling that exploits similarities between neighboring values and achieves aggressive compression ratios, up to 16:1 in our implementation. The proposed AVR architecture supports our compression scheme maximizing its effect and minimizing its overheads by (i) co-locating in the Last Level Cache (LLC) compressed and uncompressed data, (ii) efficiently handling LLC evictions, (iii) keeping track of badly compressed memory blocks, and (iv) avoiding LLC pollution with unwanted decompressed data. For applications that tolerate aggressive approximation in large fractions of their data, AVR reduces memory traffic by up to 70%, execution time by up to 55%, and energy costs by up to 20% introducing less than 1% error to the application output.
19.	Ma, Yang, et al. (författare) Towards real-time whisker tracking in rodents for studying sensorimotor disorders 2017 Ingår i: Proceedings - 2017 17th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2017. ; 2018-January, s. 137-145 Konferensbidrag (refereegranskat)abstract The rodent whisker system is a prominent experimental subject for the study of sensorimotor integration and active sensing. As a result of improved video-recording technology and progressively better neurophysiological methods, there is now the prospect of precisely analyzing the intact vibrissal sensori-motor system. The vibrissae and snout analyzer (ViSA), a widely used algorithm based on computer vision and image processing, has been proven successful for tracking and quantifying rodent sensorimotor behavior, but at a great cost in processing time. In order to accelerate this offline algorithm and eventually employ it for online whisker tracking (less than 1 ms/frame latency), we have explored various optimizations and acceleration platforms, including OpenMP multithreading, NVidia GPUs and Maxeler Dataflow Engines. Our experimental results indicate that the optimal solution for an offline implementation of ViSA is currently the OpenMP-based CPU execution. By using 16 CPU threads, we achieve more than 4,500x speedup compared to the original Matlab serial version, resulting in an average processing latency of 1.2 ms/frame, which is a solid step towards real-time (and online) tracking. Analysis shows that running the algorithm on a 32-thread-enabled machine can reduce this number to 0.72 ms/frame, thereby enabling real-time performance. This will allow direct interaction with the whisker system during behavioral experiments. In conclusion, our approach shows that a combination of software optimizations and the careful selection of hardware platform yields the best performance increase.
20.	Malek, Alirad, 1983, et al. (författare) A Probabilistic Analysis of Resilient Reconfigurable Designs 2014 Ingår i: 27th IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2014, Amsterdam, Netherlands, 1-3 October 2014. - 1550-5774. - 9781479961559 ; , s. 141-146 Konferensbidrag (refereegranskat)abstract Reconfigurable hardware can be employed to tolerate permanent faults. Hardware components comprising a System-on-Chip can be partitioned into a handful of substitutable units interconnected with reconfigurable wires to allow isolation and replacement of faulty parts. This paper offers a probabilistic analysis of reconfigurable designs estimating for different fault densities the average number of fault-free components that can be constructed as well as the probability to guarantee a particular availability of components. Considering the area overheads of reconfigurability, we evaluate the resilience of various reconfigurable designs with different granularities. Based on this analysis, we conduct a comprehensive design-space exploration to identify the granularity mixes that maximize the fault-tolerance of a system. Our findings reveal that mixing fine-grain logic with a coarse-grain sparing approach tolerates up to 3x more permanent faults than component redundancy and 2x more than any other purely coarse-grain solution. Component redundancy is preferable at low fault densities, while coarse-grain and mixedgrain reconfigurability maximize availability at medium and high fault densities, respectively.
21.	Malek, Alirad, 1983, et al. (författare) Odd-ECC: On-demand DRAM error correcting codes 2017 Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : ACM. - 9781450353359 ; Part F131197, s. 96-101 Konferensbidrag (refereegranskat)abstract An application may have different sensitivity to faults in different subsets of the data it uses. Some data regions may therefore be more critical than others. Capitalizing on this observation, Odd-ECC provides a mechanism to dynamically select the memory fault tolerance of each allocated page of a program on demand depending on the criticality of the respective data. Odd-ECC error correcting codes (ECCs) are stored in separate physical pages and hidden by the OS as pages unavailable to the user. Still, these ECCs are physically aligned with the data they protect so the memory controller can efficiently access them. Thereby, capacity, performance and energy overheads of memory fault tolerance are proportional to the criticality of the data stored. Odd-ECC is applied to memory systems that use conventional 2D DRAM DIMMs as well as to 3D-stacked DRAMs and evaluated using various applications. Compared to flat memory protection schemes, Odd-ECC substantially reduces ECCs capacity overheads while achieving the same Mean Time to Failure (MTTF) and in addition it slightly improves performance and energy costs. Under the same capacity constraints, Odd-ECC achieves substantially higher MTTF, compared to a flat memory protection. This comes at a performance and energy cost, which is however still a fraction of the cost introduced by a flat equally strong scheme.
22.	Malek, Alirad, 1983, et al. (författare) Reducing the performance overhead of resilient CMPs with substitutable resources 2015 Ingår i: Proceedings of the 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFTS 2015. - 9781509003129 ; , s. 191-196 Konferensbidrag (refereegranskat)abstract Permanent faults on a chip are often tolerated using spare resources. In the past, sparing has been applied to Chip Multiprocessors (CMPs) at various granularities of substitutable units (SUs). Entire processors, pipeline stages or even individual functional units are isolated when faulty and replaced by spare ones using flexible, reconfigurable interconnects. Although spare resources increase systems fault tolerance, the extra delay imposed by the reconfigurable interconnects limits performance. In this paper, we study two options for dealing with this delay: (i) pipelining the reconfigurable interconnects and (ii) scaling down operating frequency. The former keeps a frequency close to the one of the baseline processor, but increases the number of cycles required for executing a program. The latter maintains the number of execution cycles constant, but requires a slower clock. We investigate the above performance tradeoff using an adaptive 4-core CMP design with substitutable pipeline stages. We retrieve post place and route results of different designs running two sets of benchmarks and evaluate their performance. Our experiments indicate that adding reconfigurable interconnects for wiring the SUs of a 4-core CMP pose significant delay increasing the critical path of the design almost by 3.5 times. On the other hand, pipelining the reconfigurable interconnects increases cycle time by 41% and - depending on the processor configuration - reduces performance overhead to 1.4-2.9× the execution time of the baseline.
23.	Malek, Alirad, 1983, et al. (författare) RQNoC: A resilient quality-of-service network-on-chip with service redirection 2016 Ingår i: Transactions on Embedded Computing Systems. - : Association for Computing Machinery (ACM). - 1558-3465 .- 1539-9087. ; 15:2, s. Art. no. 2846097- Tidskriftsartikel (refereegranskat)abstract In this article, we describe RQNoC, a service-oriented Network-on-Chip (NoC) resilient to permanent faults. We characterize the network resources based on the particular service that they support and, when faulty, bypass them, allowing the respective traffic class to be redirected. We propose two alternatives for service redirection, each having different advantages and disadvantages. The first one, Service Detour, uses longer alternative paths through resources of the same service to bypass faulty network parts, keeping traffic classes isolated. The second approach, Service Merge, uses resources of other services providing shorter paths but allowing traffic classes to interfere with each other. The remaining network resources that are common for all services employ additional mechanisms for tolerating faults. Links tolerate faults using additional spare wires combined with a flit-shifting mechanism, and the router control is protected with Triple-Modular-Redundancy (TMR). The proposed RQNoC network designs are implemented in 65nm technology and evaluated in terms of performance, area, power consumption, and fault tolerance. Service Detour requires 9% more area and consumes 7.3% more power compared to a baseline network, not tolerant to faults. Its packet latency and throughput is close to the fault-free performance at low-fault densities, but fault tolerance and performance drop substantially for 8 or more network faults. Service Merge requires 22% more area and 27% more power than the baseline and has a 9% slower clock. Compared to a faultfree network, a Service Merge RQNoC with up to 32 faults has increased packet latency up to 1.5 to 2.4× and reduced throughput to 70% or 50%. However, it delivers substantially better fault tolerance, having a mean network connectivity above 90% even with 32 network faults versus 41% of a Service Detour network. Combining Serve Merge and Service Detour improves fault tolerance, further sustaining a higher number of network faults and reduced packet latency.
24.	Martorell, Xavier, et al. (författare) Introduction to the Special Section on FPL 2019 2021 Ingår i: ACM Transactions on Reconfigurable Technology and Systems. - : Association for Computing Machinery (ACM). - 1936-7414 .- 1936-7406. ; 14:2 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)
25.	Narayanan, Surya, et al. (författare) Communication Service for hardware tasks executed on dynamic and partial reconfigurable resource 2011 Ingår i: IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC 2011). - 9781457701719 ; , s. 196-199 Konferensbidrag (refereegranskat)
26.	Narayanan, Surya, et al. (författare) Hardware OS Communication Service and Dynamic Memory Management for RSoCs 2011 Ingår i: Int. Conf. on ReConFigurable Computing and FPGAs (ReConFig 2011). - 9780769545516 ; , s. 117-122 Konferensbidrag (refereegranskat)
27.	Pnevmatikatos, Dionisios N., et al. (författare) The DeSyRe runtime support for fault-tolerant embedded MPSoCs 2014 Ingår i: Proceedings - 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2014. - 9781479942930 ; , s. 197-204 Konferensbidrag (refereegranskat)abstract Semiconductor technology scaling makes chips moresensitive to faults. This paper describes the DeSyRe designapproach and its runtime management for future reliable embedded Multiprocessor Systems-on-Chip (MPSoCs). A light weight runtime system is described for shared-memory MPSoCs to support fault-tolerant execution upon detection of transient and permanent faults. The DeSyRe runtime system offers re-execution of tasks that suffer from transient faults and task-migration in cases where a worker processor is permanently faulty. In addition, a faulty worker can potentially remainusable, increasing systems fault-tolerance. This is achieved using alternative task implementations, which avoid the faulty circuit and are indicated in the application-code via pragma annotations, as well as by repairing a faulty core via hardware reconfiguration. Thereby, the system can be dynamically adapted using one ormultiple of the above mechanisms to mitigate faults. The DeSyReruntime system is evaluated using micro-benchmarks running ona Virtex-6 FPGA MPSoC. Results suggest that our enhance dfault-tolerant runtime system can successfully and efficiently execute all application tasks under a variety of fault cases.
28.	Ramakrishnan Geethakumari, Prajith, 1986, et al. (författare) A Specialized Memory Hierarchy for Stream Aggregation 2021 Ingår i: 2021 31ST INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL 2021). - 1946-1488. - 9781665437592 ; , s. 204-210 Konferensbidrag (refereegranskat)abstract High throughput stream aggregation is essential for many applications that analyze massive volumes of data. Incoming data need to be stored in a sliding window before processing, in case the aggregation functions cannot be computed incrementally. However, this puts tremendous pressure on the memory bandwidth and capacity. GPU and CPU memory management is inefficient for this task as it introduces unnecessary data movement that wastes bandwidth. FPGAs can make more efficient use of their memory but existing approaches employ either only on-chip memory (i.e. SRAM) or only off-chip memory (i.e. DRAM) to store the aggregated values. The high on-chip SRAM bandwidth enables line-rate processing, but only for small problem sizes due to the limited capacity. The larger off-chip DRAM size supports larger problems, but falls short on performance due to lower bandwidth. This paper introduces a specialized memory hierarchy for stream aggregation. It employs multiple memory levels with different characteristics to offer both high bandwidth and capacity. In doing so, larger stream aggregation problems can be supported, i.e. large number of concurrently active keys and large sliding windows, at line-rate performance. A 3-level implementation of the proposed memory hierarchy is used in a reconfigurable stream aggregation dataflow engine (DFE), outperforming existing competing solutions. Compared to designs with only on-chip memory, our approach supports 4 orders of magnitude larger problems. Compared to designs that use only DRAM, our design achieves up to 8x higher throughput.
29.	Ramakrishnan Geethakumari, Prajith, 1986, et al. (författare) Stream Aggregation with Compressed Sliding Windows 2023 Ingår i: ACM Transactions on Reconfigurable Technology and Systems. - 1936-7414 .- 1936-7406. ; 16:3 Tidskriftsartikel (refereegranskat)abstract High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding window during processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This article addresses this problem by enhancing StreamZip, a dataflow stream aggregation engine that is able to compress the sliding windows. StreamZip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In addition, StreamZip incorporates a caching mechanism for dealing with skewed-key distributions in the incoming data stream. In doing so, StreamZip offers higher throughput as well as larger effective window capacity to support larger problems. StreamZip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating-point numbers. Compared to designs without compression, StreamZip lossless and lossy designs achieve up to 7.5× and 22× higher throughput, while improving the effective memory capacity by up to 5× and 23×, respectively.
30.	Ramakrishnan Geethakumari, Prajith, 1986, et al. (författare) Streamzip: Compressed Sliding-Windows for Stream Aggregation 2021 Ingår i: 2021 International Conference on Field-Programmable Technology, ICFPT 2021. - 9781665420105 ; :https://ieeexplore.ieee.org/document/9609952, s. 203-211 Konferensbidrag (refereegranskat)abstract High performance stream aggregation is critical for many emerging applications that analyze massive volumes of data. Incoming data needs to be stored in a sliding-window before processing, in case the aggregation functions cannot be computed incrementally. Updating the window with new incoming values and reading it to feed the aggregation functions are the two primary steps in stream aggregation. Although window updates can be supported efficiently using multi-level queues, frequent window aggregations remain a performance bottleneck as they put tremendous pressure on the memory bandwidth and capacity. This paper addresses this problem by introducing Streamzip, a dataflow stream aggregation engine that is able to compress the sliding-windows. Streamzip deals with a number of data and control dependency challenges to integrate a compressor in the stream aggregation pipeline and alleviate the memory pressure posed by frequent aggregations. In doing so, Streamzip offers higher throughput as well as larger effective window capacity to support larger problems. Streamzip supports diverse compression algorithms offering both lossless and lossy compression to integers as well as floating point numbers. Compared to designs without compression, Streamzip lossless and lossy designs achieve up to 7.5x and 22x higher throughput, while improving the effective memory capacity by up to 5x and 23x, respectively.
31.	Ramakrishnan Geethakumari, Prajith, 1986, et al. (författare) Time-SWAD: A dataflow engine for time-based single window stream aggregation 2019 Ingår i: Proceedings - 2019 International Conference on Field-Programmable Technology, ICFPT 2019. ; 2019-December, s. 72-80 Konferensbidrag (refereegranskat)abstract High throughput and low latency streaming aggregation is essential for many applications that analyze massive volumes of data in real-time. Incoming data need to be stored in a single sliding window before processing, in cases where incremental aggregations are wasteful or not possible at all; this puts tremendous pressure to the memory bandwidth. In addition, particular problems call for time-based windows, defined by a time-interval, where the amount of data per window may vary and as a consequence are more challenging to handle. This paper describes Time-SWAD, the first accelerator for time-based single-window stream aggregation. Time-SWAD is a dataflow engine (DFE), implemented on a Maxeler machine, offering high processing throughput, up to 150 Mtuples/sec, similar to related GPU systems, which however do not support both time-based and single windows. It uses a direct feed of incoming data from the network and has direct access to off-chip DRAM, enabling ultra-low processing latency of 1-10 μsec, at least 4 orders of magnitude lower than software-based solutions.
32.	Ribes, Stefano, 1992, et al. (författare) Mapping Multiple LSTM models on FPGAs 2020 Ingår i: Proceedings - 2020 International Conference on Field-Programmable Technology, ICFPT 2020. ; , s. 1-9 Konferensbidrag (refereegranskat)abstract Recurrent Neural Networks (RNNs) and their more recent variant Long Short-Term Memory (LSTM) are utilised in a number of modern applications like Natural Language Processing and human action recognition, where capturing longterm dependencies on sequential and temporal data is required. However, their computational structure imposes a challenge when it comes to their efficient mapping on a computing device due to its memory-bounded nature. As recent approaches aim to capture longer dependencies through the utilisation of Hierarchical and Stacked RNN/LSTM models, i.e. models that utilise multiple LSTM models for prediction, meeting the desired application latency becomes even more challenging. This paper addresses the problem of mapping multiple LSTM models to a device by introducing a framework that alters their computational structure opening opportunities for co-optimising the memory requirements to the target architecture. Targeting an FPGA device, the proposed framework achieves 3× to 5× improved performance over state-of-The-Art approaches for the same accuracy loss, opening the path for the deployment of high-performance systems for Hierarchical and Stacked LSTM models.
33.	Ribes, Stefano, 1992, et al. (författare) Reliability Analysis of Compressed CNNs 2021 Rapport (övrigt vetenskapligt/konstnärligt)abstract The use of artificial intelligence, Machine Learning and in particular Deep Learning (DL), have recently become a effective and standard de-facto solution for complex problems like image classification, sentiment analysis or natural language processing. In order to address the growing demand of performance of ML applications, research has focused on techniques for compressing the large amount of the parameters required by the Deep Neural Networks (DNN) used in DL. Some of these techniques include parameter pruning, weight-sharing, i.e. clustering of the weights, and parameter quantization. However, reducing the amount of parameters can lower the fault tolerance of DNNs, already sensitive to software and hardware faults caused by, among others, high particles strikes, row hammer or gradient descent attacks, et cetera. In this work we analyze the sensitivity to faults of widely used DNNs, in particular Convolutional Neural Networks (CNN), that have been compressed with the use of pruning, weight clustering and quantization. Our analysis shows that in DNNs that employ all such compression mechanisms, i.e. with their memory footprint reduced up to 86.3x, random single bit faults can result in accuracy drops up to 13.56%.
34.	Seepers, R.M., et al. (författare) Adaptive entity-identifier generation for IMD emergency access 2014 Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : ACM. - 9781450324847 ; , s. 41-44 Konferensbidrag (refereegranskat)abstract Recent work on wireless Implantable Medical Devices (IMDs) has revealed the need for secure communication in order to prevent data theft and implant abuse by malicious attackers. However, security should not be provided at the cost of patient safety and an IMD should, thus, remain accessible during an emergency regardless of device security. In this paper, we present a novel method of providing IMD emergency access, based on generating Entity Identifiers (EI) using the Inter-Pulse Intervals (IPIs) of heartbeats. We evaluate the current state-of-the-art in EI-generation in terms of security and accessibility for healthy subjects with a wide range of heart rates. Subsequently, we present an adaptive EI-generation algorithm which takes the heart rate into account, maintaining an acceptable emergency-mode activation time (between 5-55.4 s) while improving security by up to 3.4x for high heart rates. Finally, we show that activating emergency mode may consume as little as 0.24μJ from the IMD battery. Copyright © 2014 ACM.
35.	Seepers, Robert Mark, et al. (författare) Attacks on Heartbeat-Based Security Using Remote Photoplethysmography 2018 Ingår i: IEEE Journal of Biomedical and Health Informatics. - 2168-2194 .- 2168-2208. ; 22:3, s. 714-721 Tidskriftsartikel (refereegranskat)abstract The time interval between consecutive heartbeats (interpulse interval, IPI) has previously been suggested for securing mobile-health solutions. This time interval is known to contain a degree of randomness, permitting the generation of a time-and person-specific identifier. It is commonly assumed that only devices trusted by a person can make physical contact with him/her, and that this physical contact allows each device to generate a similar identifier based on its own cardiac recordings. Under these conditions, the identifiers generated by different trusted devices can facilitate secure authentication. Recently, a wide range of techniques have been proposed for measuring heartbeats remotely, a prominent example of which is remote photoplethysmography (rPPG). These techniques may pose a significant threat to heartbeat-based security, as an adversary may pretend to be a trusted device by generating a similar identifier without physical contact, thus bypassing one of the core security conditions. In this paper, we assess the feasibility of such remote attacks using state-of-the-art rPPG methods. Our evaluation shows that rPPG has similar accuracy as contact PPG and, thus, forms a substantial threat to heartbeat-based-security systems that permit trusted devices to obtain their identifiers from contact PPG recordings. Conversely, rPPG cannot obtain an accurate representation of an identifier generated from electrical cardiac signals, making the latter invulnerable to state-of-the-art remote attacks.
36.	Seepers, R.M., et al. (författare) Enhancing heart-beat-based security for mHealth applications 2017 Ingår i: IEEE Journal of Biomedical and Health Informatics. - 2168-2194 .- 2168-2208. ; 21:1, s. 254-262 Tidskriftsartikel (refereegranskat)abstract In heart-beat-based security, a security key is derived from the time difference between two consecutive heart beats (the Inter-Pulse-Interval, IPI) which may, subsequently, be used to enable secure communication. While heart-beatbased security holds promise in mobile health (mHealth) applications, there currently exists no work that provides a detailed characterization of the delivered security in a real system. In this paper, we evaluate the strength of IPI-based security keys in the context of entity authentication. We investigate several aspects which should be considered in practice, including subjects with reduced heart-rate variability, different sensor-sampling frequencies, inter-sensor variability (i.e., how accurate each entity may measure heart beats) as well as average and worst-caseauthentication time. Contrary to the current state of the art, our evaluation demonstrates that authentication using multiple, lessentropic keys may actually increase the key strength by reducing the effects of inter-sensor variability. Moreover, we find that the maximal key strength of a 60-bit key varies between 29.2 bits and only 5.7 bits, depending on the subject's heart-rate variability. To improve security, we introduce the Inter-multi-Pulse Interval (ImPI), a novel method of extracting entropy from the heart by considering the time difference between two non-consecutive heart beats. Given the same authentication time, using the ImPI for key generation increases key strength by up to 3.4x (+19.2 bits) for subjects with limited heart-rate variability, at the cost of an extended key-generation time of 4.8x (+45 sec).
37.	Seepers, R.M., et al. (författare) On using a von neumann extractor in heart-beat-based security 2015 Ingår i: Proceedings - 14th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2015, Helsinki, Finland, 20-22 August 2015. - 9781467379519 ; 1, s. 491-498 Konferensbidrag (refereegranskat)abstract The Inter-Pulse-Interval (IPI) of heart beats has previously been suggested for facilitating security in mobile health (mHealth) applications. In heart-beat-based security, a security key is derived from the time difference between consecutive heart beats. As two entities that simultaneously sample the same heart beats may generate the same key (with some inter-key disparity), these keys may be used for various security functions, such as entity authentication or data confidentiality. One of the key limitations in heart-beat-based security is the low randomness intrinsic to the most-significant bits (MSBs) in the digital representation of each IPI. In this paper, we explore the use of a von Neumann entropy extractor on these MSBs in order to increase their randomness. We show that our von Neumann key-generator produces significantly more random bits than a non-extracting key generator with an average bit-extraction rate between 13.4% and 21.9%. Despite this increase in randomness, we also find a substantial increase in inter-key disparity, increasing the mismatch tolerance required for a given true-key pair. Accordingly, the maximum-attainable effective key-strength of our key generator is only slightly higher than that of a non-extracting generator (16.4 bits compared to 15.2 bits of security for a 60-bit key), while the former requires an increase in average key-generation time of 2.5x.
38.	Seepers, R.M., et al. (författare) Peak misdetection in heart-beat-based security: Characterization and tolerance 2014 Ingår i: 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBC 2014; Chicago; United States; 26 August 2014 through 30 August 2014. - 9781424479290 ; , s. 5401-5405 Konferensbidrag (refereegranskat)abstract The Inter-Pulse-Interval (IPI) of heart beats has previously been suggested for security in mobile health (mHealth) applications. In IPI-based security, secure communication is facilitated through a security key derived from the time difference between heart beats. However, there currently exists no work which considers the effect on security of imperfect heart-beat (peak) detection. This is a crucial aspect of IPI-based security and likely to happen in a real system. In this paper, we evaluate the effects of peak misdetection on the security performance of IPI-based security. It is shown that even with a high peak detection rate between 99.9% and 99.0%, a significant drop in security performance may be observed (between -70% and -303%) compared to having perfect peak detection. We show that authenticating using smaller keys yields both stronger keys as well as potentially faster authentication in case of imperfect heart beat detection. Finally, we present an algorithm which tolerates the effect of a single misdetected peak and increases the security performance by up to 155%.
39.	Seepers, R.M., et al. (författare) Secure key-exchange protocol for implants using heartbeats 2016 Ingår i: ACM International Conference on Computing Frontiers, CF 2016; Como; Italy; 16 May 2016 through 18 May 2016. - New York, NY, USA : ACM. - 9781450341288 ; , s. 119-126 Konferensbidrag (refereegranskat)abstract The cardiac interpulse interval (IPI) has recently been pro-posed to facilitate key exchange for implantable medical de-vices (IMDs) using a patient's own heartbeats as a source of trust. While this form of key exchange holds promise for IMD security, its feasibility is not fully understood due to the simplified approaches found in related works. For exam-ple, previously proposed protocols have been designed with-out considering the limited randomness available per IPI, or have overlooked aspects pertinent to a realistic system, such as imperfect heartbeat detection or the energy overheads im-posed on an IMD. In this paper, we propose a new IPI-based key-exchange protocol and evaluate its use during medical emergencies. Our protocol employs fuzzy commitment to tolerate the expected disparity between IPIs obtained by an external reader and an IMD, as well as a novel way of tack-ling heartbeat misdetection through IPI classification. Using our protocol, the expected time for securely exchanging an 80-bit key with high probability (1-106) is roughly one minute, while consuming only 88 ?J from an IMD.
40.	Shafik, R.A., et al. (författare) Software modification aided transient error tolerance for embedded systems 2013 Ingår i: Proceedings - 16th Euromicro Conference on Digital System Design, DSD 2013. - 9780769550749 ; , s. 219-226 Konferensbidrag (refereegranskat)abstract Commercial off-the-shelf (COTS) components are increasingly being employed in embedded systems due to their high performance at low cost. With emerging reliability requirements, design of these components using traditional hardware redundancy incur large overheads, time-demanding re-design and validation. To reduce the design time with shorter time-to-market requirements, software-only reliable design techniques can provide with an effective and low-cost alternative. This paper presents a novel, architecture-independent software modification tool, SMART (Software Modification Aided transient eRror Tolerance) for effective error detection and tolerance. To detect transient errors in processor data path, control flow and memory at reasonable system overheads, the tool incorporates selective and non-intrusive data duplication and dynamic signature comparison. Also, to mitigate the impact of the detected errors, it facilitates further software modification implementing software-based check-pointing. Due to automatic software based source-to-source modification tailored to a given reliability requirement, the tool requires no re-design effort, hardware- or compiler-level intervention. We evaluate the effectiveness of the tool using a Xentium processor based system as a case study of COTS based systems. Using various benchmark applications with single-event upset (SEUs) based error model, we show that up to 91% of the errors can be detected or masked with reasonable performance, energy and memory footprint overheads. © 2013 IEEE.
41.	Smaragdos, G., et al. (författare) A dependable coarse-grain reconfigurable multicore array 2014 Ingår i: Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS. - 2332-1237. - 9780769552088 ; , s. 141-150 Konferensbidrag (refereegranskat)abstract © 2014 IEEE. Recent trends in semiconductor technology have dictated the constant reduction of device size. One negative effect stemming from the reduction in size and increased complexity is the reduced device reliability. This paper is centered around the matter of permanent fault tolerance and graceful system degradation in the presence of permanent faults. We take advantage of the natural redundancy of homogeneous multicores following a sparing strategy to reuse functional pipeline stages of faulty cores. This is done by incorporating reconfigurable interconnects next to which the cores of the system are placed, providing the flexibility to redirect the data-flow from the faulty pipeline stages of damaged cores to spare (still) functional ones. Several micro-architectural changes are introduced to decouple the processor stages and allow them to be interchangeable. The proposed approach is a clear departure from previous ones by offering full flexibility as well as highly graceful performance degradation at reasonable costs. More specifically, our coarsegrain faulttolerant multicore array provides up to ×4 better availability compared to a conventional multicore and up to ×2 higher probability to deliver at least one functioning core in high fault densities. For our benchmarks, our design (synthesized for STM 65nm SP technology) incurs a total execution-time overhead for the complete system ranging from ×1.37 to ×3.3 compared to a (baseline) non-fault-tolerant system, depending on the permanent-fault density. The area overhead is 19.5% and the energy consumption, without incorporating any power/energy- saving technique, is estimated on average to be 20.9% higher compared to the baseline, unprotected design.
42.	Smaragdos, G., et al. (författare) BrainFrame: a node-level heterogeneous accelerator platform for neuron simulations 2017 Ingår i: Journal of Neural Engineering. - : IOP Publishing. - 1741-2560 .- 1741-2552. ; 14:6 Tidskriftsartikel (refereegranskat)abstract Objective: The advent of High-Performance Computing (HPC) in recent years has led to its increasing use in brain study through computational models. The scale and complexity of such models are constantly increasing, leading to challenging computational requirements. Even though modern HPC platforms can often deal with such challenges, the vast diversity of the modeling field does not permit for a homogeneous acceleration platform to effectively address the complete array of modeling requirements. Approach: In this paper we propose and build BrainFrame, a heterogeneous acceleration platform that incorporates three distinct acceleration technologies, an Intel Xeon-Phi CPU, a NVidia GP-GPU and a Maxeler Dataflow Engine. The PyNN software framework is also integrated into the platform. As a challenging proof of concept, we analyze the performance of BrainFrame on different experiment instances of a state-of-the-art neuron model, representing the Inferior-Olivary Nucleus using a biophysically-meaningful, extended Hodgkin-Huxley representation. The model instances take into account not only the neuronal-network dimensions but also different network-connectivity densities, which can drastically affect the workload's performance characteristics. Main results: The combined use of different HPC fabrics demonstrated that BrainFrame is better able to cope with the modeling diversity encountered in realistic experiments. Our performance analysis shows clearly that the model directly affects performance and all three technologies are required to cope with all the model use cases. Significance: The BrainFrame framework is designed to transparently configure and select the appropriate back-end accelerator technology for use per simulation run. The PyNN integration provides a familiar bridge to the vast number of models already available. Additionally, it gives a clear roadmap for extending the platform support beyond the proof of concept, with improved usability and directly useful features to the computational-neuroscience community, paving the way for wider adoption.
43.	Smaragdos, G., et al. (författare) FPGA-based biophysically-meaningful modeling of olivocerebellar neurons 2014 Ingår i: 2014 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA 2014; Monterey, CA; United States; 26 February 2014 through 28 February 2014. - New York, NY, USA : ACM. - 9781450326711 ; , s. 89-98 Konferensbidrag (refereegranskat)abstract The Inferior-Olivary nucleus (ION) is a well-charted region of the brain, heavily associated with sensorimotor control of the body. It comprises ION cells with unique properties which facilitate sensory processing and motor-learning skills. Various simulation models of ION-cell networks have been written in an attempt to unravel their mysteries. However, simulations become rapidly intractable when biophysically plausible models and meaningful network sizes (100 cells) are modeled. To overcome this problem, in this work we port a highly detailed ION cell network model, originally coded in Matlab, onto an FPGA chip. It was first converted to ANSI C code and extensively profiled. It was, then, translated to HLS C code for the Xilinx Vivado toolflow and various algorithmic and arithmetic optimizations were applied. The design was implemented in a Virtex 7 (XC7VX485T) device and can simulate a 96-cell network at real-time speed, yielding a speedup of 700 compared to the original Matlab code and 12.5 compared to the reference C implementation running on a Intel Xeon 2.66GHz machine with 20GB RAM. For a 1,056-cell network (non-real-time), an FPGA speedup of 45 against the C code can be achieved, demonstrating the design's usefulness in accelerating neuroscience research. Limited by the available on-chip memory, the FPGA can maximally support a 14,400-cell network (non-real-time) with online parameter configurability for cell state and network size. The maximum throughput of the FPGA IONnetwork accelerator can reach 2.13 GFLOPS.
44.	Smaragdos, G., et al. (författare) Performance Analysis of Accelerated Biophysically-Meaningful Neuron Simulations 2016 Ingår i: 2016 Ieee International Symposium on Performance Analysis of Systems and Software Ispass 2016. - 9781509019533 ; , s. 1-11 Konferensbidrag (refereegranskat)abstract In-vivo and in-vitro experiments are routinely used in neuroscience to unravel brain functionality. Although they are a powerful experimentation tool, they are also time-consuming and, often, restrictive. Computational neuroscience attempts to solve this by using biologically-plausible and biophysically-meaningful neuron models, most prominent among which are the conductance-based models. Their computational complexity calls for accelerator-based computing to mount large-scale or real-time neuroscientific experiments. In this paper, we analyze and draw conclusions on the class of conductance models by using a representative modeling application of the inferior olive (InfOli), an important part of the olivocerebellar brain circuit. We conduct an extensive profiling session to identify the computational and data-transfer requirements of the application under various realistic use cases. The application is, then, ported onto two acceleration nodes, an Intel Xeon Phi and a Maxeler Vectis Data Flow Engine (DFE). We evaluate the performance scalability and resource requirements of the InfOli application on the two target platforms. The analysis of InfOli, which is a real-life neuroscientific application, can serve as a useful guide for porting a wide range of similar workloads on platforms like the Xeon Phi or the Maxeler DFEs. As accelerators are increasingly populating High-Performance Computing (HPC) infrastructure, the current paper provides useful insight on how to optimally use such nodes to run complex and relevant neuron modeling workloads.
45.	Smaragdos, G., et al. (författare) Real-time olivary neuron simulations on dataflow computing machines 2014 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Cham : Springer International Publishing. - 1611-3349 .- 0302-9743. - 9783319075174 ; 8488, s. 487-497 Konferensbidrag (refereegranskat)abstract The Inferior-Olivary nucleus (ION) is a well-charted brain region, heavily associated with the sensorimotor control of the body. It comprises neural cells with unique properties which facilitate sensory processing and motor-learning skills. Simulations of such neurons become rapidly intractable when biophysically plausible models and meaningful network sizes (at least in the order of some hundreds of cells) are modeled. To overcome this problem, we accelerate a highly detailed ION network model using a Maxeler Dataflow Computing Machine. The design simulates a 330-cell network at real-time speed and achieves maximum throughputs of 24.7 GFLOPS. The Maxeler machine, integrating a Virtex-6 FPGA, yields speedups of ×92-102, and ×2-8 compared to a reference-C implementation, running on a Intel Xeon 2.66GHz, and a pure Virtex-7 FPGA implementation, respectively.
46.	Sourdis, Ioannis, 1979, et al. (författare) DeSyRe: On-demand adaptive and reconfigurable fault-tolerant SoCs 2014 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Cham : Springer International Publishing. - 1611-3349 .- 0302-9743. ; 8405, s. 312-317 Konferensbidrag (refereegranskat)abstract The DeSyRe project builds on-demand adaptive, reliable Systems-on-Chips. In response to the current semiconductor technology trends thatmake chips becoming less reliable, DeSyRe describes a newgeneration of by design reliable systems, at a reduced power and performance cost. This is achieved through the following main contributions. DeSyRe defines a fault-tolerant system architecture built out of unreliable components, rather than aiming at totally fault-free and hence more costly chips. In addition, DeSyRe systems are on-demand adaptive to various types and densities of faults, as well as to other system constraints and application requirements. For leveraging on-demand adaptation/customization and reliability at reduced cost, a new dynamically reconfigurable substrate is designed and combined with runtime system software support. The above define a generic and repeatable design framework, which is applied to two medical SoCs with high reliability constraints and diverse performance and power requirements. One of the main goals of the DeSyRe project is to increase the availability of SoC components in the presence of permanents faults, caused at manufacturing time or due to device aging. A mix of coarse- and fine-grain reconfigurable hardware substrate is designed to isolate and bypass faulty component parts. The flexibility provided by the DeSyRe reconfigurable substrate is exploited at runtime by system optimization heuristics,which decide tomodify component configurationwhen a permanent fault is detected, providing graceful degradation.
47.	Sourdis, Ioannis, 1979, et al. (författare) DeSyRe: On-demand system reliability 2013 Ingår i: Microprocessors and Microsystems. - : Elsevier BV. - 0141-9331. ; 37:8, s. 981-1001 Tidskriftsartikel (refereegranskat)abstract The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect-/fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints.
48.	Sourdis, Ioannis, 1979, et al. (författare) Guest editorial: Workshop on Reconfigurable Computing 2013 Ingår i: Journal of Systems Architecture. - : Elsevier BV. - 1383-7621. ; 59:2, s. 77-77 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)
49.	Sourdis, Ioannis, 1979, et al. (författare) HiPEAC: Upcoming Challenges in Reconfigurable Computing 2011 Ingår i: Springer Science+Business Media. - New York, NY : Springer New York. - 9781461400608 ; , s. 35-52 Bokkapitel (övrigt vetenskapligt/konstnärligt)abstract The new developments in semiconductor technology cause significant problems in chips’ performance, power consumption and reliability, indicating that the “golden” CMOS era is long gone. Technology scaling does not deliver anymore significant performance speedup, the increasing power density poses severe limitations in chips, while, transistors become less reliable. The above introduce great challenges for reconfigurable computing; that is to provide the answer to the performance, power-efficiency and reliability quest posed by current technology trends. Reconfigurable Computing has the potential to achieve such a goal; however, several improvements are required to be performed first. In this chapter, we discuss a number of issues which need to be addressed in order to make Reconfigurable Computing a widely used solution for future systems.
50.	Sourdis, Ioannis, 1979, et al. (författare) Longest prefix match and updates in range tries 2011 Ingår i: Proceedings - 22nd IEEE International Conference on Application-Specific Systems, Architectures and Processors, Santa Monica, 11-14 September 2011. - 1063-6862. - 9781457712920 ; , s. 51-58 Konferensbidrag (refereegranskat)abstract In this paper, we describe an IP-Lookup method for network routing. We extend the basic Range Trie data-structure to support Longest Prefix Match (LPM) and incremental updates. Range Tries improve on the existing Range Trees allowing shorter comparisons than the address width. In so doing, Range Tries scale better their lookup latency and memory requirements with the wider upcoming IPv6 addresses. However, as in Range Trees, a Range Trie does not inherently support LPM, while incremental updates have a performance and memory overhead. We describe the additions required to the basic Range Trie structure and its hardware design in order to store and dynamically update prefixes for supporting LPM. The proposed approach is prototyped in a Virtex4 FPGA and synthesized for 90-nm ASICs. Range Trie is evaluated using Internet Routing Tables and traces of updates. Supporting LPM roughly doubles the memory size of the basic Range Trie, which is still half compared to the second best related work. The proposed design performs one lookup per cycle and one prefix update every four cycles.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Resultat 1-50 av 66

Avgränsa träffmängd

Typ av publikation: konferensbidrag (44); tidskriftsartikel (18); rapport (3); bokkapitel (1)

Typ av innehåll: refereegranskat (59); övrigt vetenskapligt/konstnärligt (7)

Författare/redaktör: Sourdis, Ioannis, 19 ... (66); Strydis, C. (21); Petersen Moura Tranc ... (13); Papaefstathiou, Vasi ... (10); Ejaz, Ahsen, 1986 (8); Papaefstathiou, Ioan ... (3); visa fler...; Soudris, D. (3); Gulisano, Vincenzo M ... (2); Strikos, Panagiotis, ... (2); Pericas, Miquel, 197 ... (2); Thomson, J. (2); Arelakis, Angelos, 1 ... (2); Athanasopoulos, E. (2); Boehner, M. (2); Giuffrida, C. (2); Pidan, D. (2); Prevelakis, V. (2); Peris-Lopez, P. (2); Sun, L. (1); Davies, C (1); Alvarez, Lluc (1); Ruiz, Abraham (1); Bigas-Soldevilla, Ar ... (1); Kuroedov, Pavel (1); Gonzalez, Alberto (1); Mahale, Hamsika (1); Bustamante, Noe (1); Aguilera, Albert (1); Minervini, Francesco (1); Salamero, Javier (1); Palomar, Oscar (1); Psathakis, A. (1); Dimou, Nikolaos (1); Giaourtas, Michalis (1); Mastorakis, Iasonas (1); Ieronymakis, Georgio ... (1); Matzouranis, Georgio ... (1); Flouris, Vasilis (1); Kossifidis, Nick (1); Marazakis, Manolis (1); Goel, Bhavishya, 198 ... (1); Manivannan, Madhavan ... (1); Vázquez Maceiras, Ma ... (1); Stenström, Per, 1957 (1); Hagemeyer, Jens (1); Tigges, L. (1); Kucza, Nils (1); Philippe, Jean Marc (1); Ioannidis, S. (1); Alvarez, Carlos (1); visa färre...

Lärosäte: Chalmers tekniska högskola (66)

Språk: Engelska (66)

Forskningsämne (UKÄ/SCB): Naturvetenskap (55); Teknik (31); Medicin och hälsovetenskap (2)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

LIBRIS.kb.se

Stäng

Kopiera och spara länken för att återkomma till aktuell vy