SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Sourdis Ioannis 1979) srt2:(2020-2023)"

Sökning: WFRF:(Sourdis Ioannis 1979) > (2020-2023)

  • Resultat 1-10 av 15
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Alvarez, Lluc, et al. (författare)
  • eProcessor: European, Extendable, Energy-Efficient, Extreme-Scale, Extensible, Processor Ecosystem
  • 2023
  • Ingår i: Proceedings of the 20th ACM International Conference on Computing Frontiers 2023, CF 2023. ; , s. 309-314
  • Konferensbidrag (refereegranskat)abstract
    • The eProcessor project aims at creating a RISC-V full stack ecosystem. The eProcessor architecture combines a high-performance out-of-order core with energy-efficient accelerators for vector processing and artificial intelligence with reduced-precision functional units. The design of this architecture follows a hardware/software co-design approach with relevant application use cases from the high-performance computing, bioinformatics and artificial intelligence domains. Two eProcessor prototypes will be developed based on two fabricated eProcessor ASICs integrated into a computer-on-module.
  •  
2.
  • Pericas, Miquel, 1979, et al. (författare)
  • Preface
  • 2022
  • Ingår i: Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors. - 1063-6862. ; 2022-July, s. IX-
  • Konferensbidrag (övrigt vetenskapligt/konstnärligt)
  •  
3.
  • Ejaz, Ahsen, 1986, et al. (författare)
  • FastTrackNoC: A DDR NoC with FastTrack Router Datapaths
  • 2021
  • Rapport (övrigt vetenskapligt/konstnärligt)abstract
    • This paper introduces FastTrackNoC, a Network-on-Chip router architecture that reduces latency by bypassing its switch traversal (ST) stage. FastTrackNoC adds a fast-track path between the head of a particular virtual channel (VC) buffer at each input port and the link of the opposite output. This allows non-turning flits to bypass ST when the required router resources are available. FastTrackNoC combines ST bypassing with existing techniques for reducing latency, namely, pipeline bypassing of control stages, precomputed routing and lookahead control signaling, to allow at best a flit to proceed directly to link traversal (LT). FastTrackNoC is applied to a Dual Data Rate (DDR) router in order to maximize throughput. Post place and route results in 28nm technology show that (i) compared to the current state of the art DDR NoCs, FastTrackNoC offers the same throughput and reduces average packet latency by 11-32% requiring up to 5% more power and (ii) compared to current state of the art Single Data Rate (SDR) NoCs, FastTrackNoC reduces packet latency by 9-40% and achieves 16-19% higher throughput with 5% higher power at the SDR NoC saturation point.
  •  
4.
  • Ejaz, Ahsen, 1986, et al. (författare)
  • FastTrackNoC: A NoC with FastTrack Router Datapaths
  • 2022
  • Ingår i: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. ; 2022-April, s. 971-985
  • Konferensbidrag (refereegranskat)abstract
    • This paper introduces FastTrackNoC, a Network-on-Chip (NoC) router architecture that reduces packet latency by bypassing its switch traversal (ST) stage. It is based on the observation that there is a bias in the direction a flit takes through a router, e.g., in a 2D mesh network, non-turning hops are preferred, especially when dimension order routing is used. FastTrackNoC capitalizes on this observation and adds to a 2D mesh router a fast-track path between the head of a single input virtual channel (VC) buffer and its most popular, opposite output. This allows non-turning flits to bypass ST logic, i.e., buffer-, input-and output multiplexing, when the required router resources are available. FastTrackNoC combines ST bypassing with existing techniques for reducing latency, namely, allocation bypassing, precomputed routing, and lookahead control signaling to allow at best incoming flits to proceed directly to link traversal (LT). Moreover, it is applied to a Dual Data Rate (DDR) router in order to maximize network throughput. Post place and route results in 28nm show the following: compared to previous DDR NoCs, FastTrackNoC offers 13-32% lower average packet latency; compared to previous multi-VC Single Data Rate (SDR) NoCs, FastTrackNoC reduces latency by 10-40% and achieves 18-21% higher throughput, and compared to single-channel SDR NoC offers up to 50% higher throughput and similar latency.
  •  
5.
  • Ejaz, Ahsen, 1986, et al. (författare)
  • HighwayNoC: Approaching Ideal NoC Performance With Dual Data Rate Routers
  • 2021
  • Ingår i: IEEE/ACM Transactions on Networking. - 1558-2566 .- 1063-6692. ; 29:1, s. 318-331
  • Tidskriftsartikel (refereegranskat)abstract
    • This paper describes HighwayNoC,, a Network-on-chip (NoC) that approaches ideal network performance using a Dual Data Rate (DDR) datapath. Based on the observation that routers datapath is faster than control, a DDR NoC allows flits to be routed at DDR improving throughput to rates defined solely by the datapath, rather than by the control. DDR NoCs can use pipeline bypassing to reduce packet latency at low traffic load. However, existing DDR routers offer bypassing only on in-network, non-turning hops to simplify the required logic. HighwayNoC, extends bypassing support of DDR routers to local ports, allowing flits to enter and exit the network faster. Moreover, it simplifies the DDR switch allocation and the interface of router ports reducing power and area costs. Post place and route results in 28 nm technology show that HighwayNoC, performs better than current state of the art NoCs. Compared to previous DDR NoCs, HighwayNoC, reduces average packet latency by 7.3-27% and power consumption by 1-10%, without affecting throughput. Compared to existing Single Data Rate NoCs, HighwayNoC, achieves 17-22% higher throughput, has similar or up to 13.8% lower packet latency, and mixed energy efficiency results.
  •  
6.
  • Eldstål-Ahrens, Albin, 1988, et al. (författare)
  • FlatPack: Flexible Compaction of Compressed Memory
  • 2022
  • Ingår i: Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT. - New York, NY, USA : ACM. - 1089-795X. ; , s. 96-108
  • Konferensbidrag (refereegranskat)abstract
    • The capacity and bandwidth of main memory is an increasingly important factor in computer system performance. Memory compression and compaction have been combined to increase effective capacity and reduce costly page faults. However, existing systems typically maintain compaction at the expense of bandwidth. One major cause of extra traffic in such systems is page overflows, which occur when data compressibility degrades and compressed pages must be reorganized. This paper introduces FlatPack, a novel approach to memory compaction which is able to mitigate this overhead by reorganizing compressed data dynamically with less data movement. Reorganization is carried out by an addition to the memory controller, without intervention from software. FlatPack is able to maintain memory capacity competitive with current state-of-the-art memory compression designs, while reducing mean memory traffic by up to 67%. This yields average improvements in performance and total system energy consumption over existing memory compression solutions of 31-46% and 11-25%, respectively. In total, FlatPack improves on baseline performance and energy consumption by 108% and 40%, respectively, in a single-core system, and 83% and 23%, respectively, in a multi-core system.
  •  
7.
  • Eldstål-Ahrens, Albin, 1988, et al. (författare)
  • L2C: Combining Lossy and Lossless Compression on Memory and I/O
  • 2022
  • Ingår i: Transactions on Embedded Computing Systems. - : Association for Computing Machinery (ACM). - 1558-3465 .- 1539-9087. ; 21:1
  • Tidskriftsartikel (refereegranskat)abstract
    • In this paper we introduce L2C, a hybrid lossy/lossless compression scheme applicable both to the memory subsystem and I/O traffic of a processor chip. L2C employs general-purpose lossless compression and combines it with state of the art lossy compression to achieve compression ratios up to 16:1 and improve the utilization of chip's bandwidth resources. Compressing memory traffic yields lower memory access time, improving system performance and energy efficiency. Compressing I/O traffic offers several benefits for resource-constrained systems, including more efficient storage and networking. We evaluate L2C as a memory compressor in simulation with a set of approximation-tolerant applications. L2C improves baseline execution time by an average of 50\%, and total system energy consumption by 16%. Compared to the lossy and lossless current state of the art memory compression approaches, L2C improves execution time by 9% and 26% respectively, and reduces system energy costs by 3% and 5%, respectively. I/O compression efficacy is evaluated using a set of real-life datasets. L2C achieves compression ratios of up to 10.4:1 for a single dataset and on average about 4:1, while introducing no more than 0.4% error.
  •  
8.
  • Eldstål-Ahrens, Albin, 1988, et al. (författare)
  • MemSZ: Squeezing Memory Traffic with Lossy Compression
  • 2020
  • Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 17:4
  • Tidskriftsartikel (refereegranskat)abstract
    • This article describes Memory Squeeze (MemSZ), a new approach for lossy general-purpose memory compression. MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in our implementation. Our compressor is placed between the memory controller and the cache hierarchy of a processor to reduce the memory traffic of applications that tolerate approximations in parts of their data. Thereby, the available off-chip bandwidth is utilized more efficiently improving system performance and energy efficiency. Two alternative multi-core variants of the MemSZ system are described. The first variant has a shared last-level cache (LLC) on the processor-die, which is modified to store both compressed and uncompressed data. The second has a 3D-stacked DRAM cache with larger cache lines that match the granularity of the compressed memory blocks and stores only uncompressed data. For applications that tolerate aggressive approximation in large fractions of their data, MemSZ reduces baseline memory traffic by up to 81%, execution time by up to 62%, and energy costs by up to 25% introducing up to 1.8% error to the application output. Compared to the current state-of-the-art lossy memory compression design, MemSZ improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.
  •  
9.
  • Martorell, Xavier, et al. (författare)
  • Introduction to the Special Section on FPL 2019
  • 2021
  • Ingår i: ACM Transactions on Reconfigurable Technology and Systems. - : Association for Computing Machinery (ACM). - 1936-7414 .- 1936-7406. ; 14:2
  • Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)
  •  
10.
  • Ramakrishnan Geethakumari, Prajith, 1986, et al. (författare)
  • A Specialized Memory Hierarchy for Stream Aggregation
  • 2021
  • Ingår i: 2021 31ST INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL 2021). - 1946-1488. - 9781665437592 ; , s. 204-210
  • Konferensbidrag (refereegranskat)abstract
    • High throughput stream aggregation is essential for many applications that analyze massive volumes of data. Incoming data need to be stored in a sliding window before processing, in case the aggregation functions cannot be computed incrementally. However, this puts tremendous pressure on the memory bandwidth and capacity. GPU and CPU memory management is inefficient for this task as it introduces unnecessary data movement that wastes bandwidth. FPGAs can make more efficient use of their memory but existing approaches employ either only on-chip memory (i.e. SRAM) or only off-chip memory (i.e. DRAM) to store the aggregated values. The high on-chip SRAM bandwidth enables line-rate processing, but only for small problem sizes due to the limited capacity. The larger off-chip DRAM size supports larger problems, but falls short on performance due to lower bandwidth. This paper introduces a specialized memory hierarchy for stream aggregation. It employs multiple memory levels with different characteristics to offer both high bandwidth and capacity. In doing so, larger stream aggregation problems can be supported, i.e. large number of concurrently active keys and large sliding windows, at line-rate performance. A 3-level implementation of the proposed memory hierarchy is used in a reconfigurable stream aggregation dataflow engine (DFE), outperforming existing competing solutions. Compared to designs with only on-chip memory, our approach supports 4 orders of magnitude larger problems. Compared to designs that use only DRAM, our design achieves up to 8x higher throughput.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 15

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy