↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Träfflista för sökning "WFRF:(Papaefstathiou Vasileios 1980) "

Sökning: WFRF:(Papaefstathiou Vasileios 1980)

Resultat 1-10 av 23

Sortera/gruppera träfflistan

Sortering: Träffar per sida:

Numrering	Referens	Omslagsbild	Hitta
1.	Alvarez, Lluc, et al. (författare) eProcessor: European, Extendable, Energy-Efficient, Extreme-Scale, Extensible, Processor Ecosystem 2023 Ingår i: Proceedings of the 20th ACM International Conference on Computing Frontiers 2023, CF 2023. ; , s. 309-314 Konferensbidrag (refereegranskat)abstract The eProcessor project aims at creating a RISC-V full stack ecosystem. The eProcessor architecture combines a high-performance out-of-order core with energy-efficient accelerators for vector processing and artificial intelligence with reduced-precision functional units. The design of this architecture follows a hardware/software co-design approach with relevant application use cases from the high-performance computing, bioinformatics and artificial intelligence domains. Two eProcessor prototypes will be developed based on two fabricated eProcessor ASICs integrated into a computer-on-module.
2.	Ioannou, Aggelos D., et al. (författare) UNILOGIC: A Novel Architecture for Highly Parallel Reconfigurable Systems 2020 Ingår i: ACM Transactions on Reconfigurable Technology and Systems. - : Association for Computing Machinery (ACM). - 1936-7414 .- 1936-7406. ; 13:4 Tidskriftsartikel (refereegranskat)abstract One of the main characteristics of High-performance Computing (HPC) applications is that they become increasingly performance and power demanding, pushing HPC systems to their limits. Existing HPC systems have not yet reached exascale performance mainly due to power limitations. Extrapolating from today's top HPC systems, about 100-200 MWatts would be required to sustain an exaflop-level of performance. A promising solution for tackling power limitations is the deployment of energy-efficient reconfigurable resources (in the form of Field-programmable Gate Arrays (FPGAs)) tightly integrated with conventional CPUs. However, current FPGA tools and programming environments are optimized for accelerating a single application or even task on a single FPGA device. In this work, we present UNILOGIC (Unified Logic), a novel HPC-tailored parallel architecture that efficiently incorporates FPGAs. UNILOGIC adopts the Partitioned Global Address Space (PGAS) model and extends it to include hardware accelerators, i.e., tasks implemented on the reconfigurable resources. The main advantages of UNILOGIC are that (i) the hardware accelerators can be accessed directly by any processor in the system, and (ii) the hardware accelerators can access any memory location in the system. In this way, the proposed architecture offers a unified environment where all the reconfigurable resources can be seamlessly used by any processor/operating system. The UNILOGIC architecture also provides hardware virtualization of the reconfigurable logic so that the hardware accelerators can be shared among multiple applications or tasks. The FPGA layer of the architecture is implemented by splitting its reconfigurable resources into (i) a static partition, which provides the PGAS-related communication infrastructure, and (ii) fixed-size and dynamically reconfigurable slots that can be programmed and accessed independently or combined together to support both line and coarse grain reconfiguration.(1) Finally, the UNILOGIC architecture has been evaluated on a custom prototype that consists of two 1U chassis, each of which includes eight interconnected daughter boards, called Quad-FPGA Daughter Boards (QFDBs); each QFDB supports four tightly coupled Xilinx Zynq Ultrascalei MPSoCs as well as 64 Gigabytes of DDR4 memory, and thus, the prototype features a total of 64 Zynq MPSoCs and 1 Terabyte of memory. We tuned and evaluated the UNILOGIC prototype using both low-level (baremetal) performance tests, as well as two popular real-world HPC applications, one compute-intensive and one data-intensive. Our evaluation shows that UNILOGIC offers impressive performance that ranges from being 2.5 to 400 times faster and 46 to 300 times more energy efficient compared to conventional parallel systems utilizing only high-end CPUs, while it also outperforms GPUs by a factor ranging from 3 to 6 times in terms of time to solution, and from 10 to 20 times in terms of energy to solution.
3.	Mavroidis, Iakovos, et al. (författare) ECOSCALE: Reconfigurable computing and runtime system for future exascale systems 2016 Ingår i: 19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016, Dresden, Germany, 14-18 March 2016. - 1530-1591. - 9783981537062 ; , s. 696-701 Konferensbidrag (refereegranskat)abstract In order to reach exascale performance, current HPC systems need to be improved. Simple hardware scaling is not a feasible solution due to the increasing utility costs and power consumption limitations. Apart from improvements in implementation technology, what is needed is to refine the HPC application development flow as well as the system architecture of future HPC systems. ECOSCALE tackles these challenges by proposing a scalable programming environment and architecture, aiming to substantially reduce energy consumption as well as data traffic and latency. ECOSCALE introduces a novel heterogeneous energy-efficient hierarchical architecture, as well as a hybrid many-core+OpenCL programming environment and runtime system. The ECOSCALE approach is hierarchical and is expected to scale well by partitioning the physical system into multiple independent Workers (i.e. compute nodes). Workers are interconnected in a tree-like fashion and define a contiguous global address space that can be viewed either as a set of partitions in a Partitioned Global Address Space (PGAS), or as a set of nodes hierarchically interconnected via an MPI protocol. To further increase energy efficiency, as well as to provide resilience, the Workers employ reconfigurable accelerators mapped into the virtual address space utilizing a dual stage System Memory Management Unit with coherent memory access. The architecture supports shared partitioned reconfigurable resources accessed by any Worker in a PGAS partition, as well as automated hardware synthesis of these resources from an OpenCL-based programming model.
4.	Azhar, Muhammad Waqar, 1986, et al. (författare) SLOOP: QoS-Supervised Loop Execution to Reduce Energy on Heterogeneous Architectures 2017 Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 14:4, s. Article No. 41- Tidskriftsartikel (refereegranskat)abstract Most systems allocate computational resources to each executing task without any actual knowledge of the application’s Quality-of-Service (QoS) requirements. Such best-effort policies lead to overprovisioning of the resources and increase energy loss. This work assumes applications with soft QoS requirements and exploits the inherent timing slack to minimize the allocated computational resources to reduce energy consumption. We propose a lightweight progress-tracking methodology based on the outer loops of application kernels. It builds on online history and uses it to estimate the total execution time. The prediction of the execution time and the QoS requirements are then used to schedule the application on a heterogeneous architecture with big out-of-order cores and small (LITTLE) in-order cores and select the minimum operating frequency, using DVFS, that meets the deadline. Our scheme is effective in exploiting the timing slack of each application. We show that it can reduce the energy consumption by more than 20% without missing any computational deadlines.
5.	Ejaz, Ahsen, 1986, et al. (författare) DDRNoC: Dual Data-Rate Network-on-Chip 2018 Ingår i: Transactions on Architecture and Code Optimization. - : Association for Computing Machinery (ACM). - 1544-3973 .- 1544-3566. ; 15:2 Tidskriftsartikel (refereegranskat)abstract This article introduces DDRNoC, an on-chip interconnection network capable of routing packets at Dual Data Rate. The cycle time of current 2D-mesh Network-on-Chip routers is limited by their control as opposed to the datapath (switch and link traversal), which exhibits significant slack. DDRNoC capitalizes on this observation, allowing two flits per cycle to share the same datapath. Thereby, DDRNoC achieves higher throughput than a Single Data Rate (SDR) network. Alternatively, using lower voltage circuits, the above slack can be exploited to reduce power consumption while matching the SDR network throughput. In addition, DDRNoC exhibits reduced clock distribution power, improving energy efficiency, as it needs a slower clock than a SDR network that routes packets at the same rate. Post place and route results in 28nm technology show that, compared to an iso-voltage (1.1V) SDR network, DDRNoC improves throughput proportionally to the SDR datapath slack. Moreover, a low-voltage (0.95V) DDRNoC implementation converts that slack to power reduction offering the 1.1V SDR throughput at a substantially lower energy cost.
6.	Ejaz, Ahsen, 1986, et al. (författare) FreewayNoC: A DDR NoC with Pipeline Bypassing 2018 Ingår i: 2018 12th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2018. - 9781538648933 Konferensbidrag (refereegranskat)abstract This paper introduces FreewayNoC, a Network-on-chip that routes packets at Dual Data Rate (DDR) and allows pipeline bypassing. Based on the observation that routers datapath is faster than control, a recent NoC design allowed flits to be routed at DDR improving throughput to rates defined solely by the switch and link traversal, rather than by the control. However, such a DDR NoC suffers from high packet latency as flits require multiple cycles per hop. A common way to reduce latency at low traffic load is pipeline bypassing, then, flits that find a contention-free way to the output port can directly traverse the switch. Existing Single Data Rate (SDR) NoC routers support it, but applying pipeline bypassing to a DDR router is more challenging. It would need additional bypassing logic which would add to the cycle time compromising the DDR NoC throughput advantage. FreewayNoC design restricts pipeline bypassing on a DDR router to only flits that go straight simplifying its logic. Thereby, it offers low packet latency without affecting DDR router cycle time and throughput. Then, at low traffic loads, besides the few turns that a flit would take on its way from source to destination, all other hops could potentially offer minimum latency equal to the delay of the switch and link traversal. Post place and route results in 28 nm technology confirm the above and also show that zero-load latency scales to the hop count better than current state-of-the-art NoCs.
7.	Ejaz, Ahsen, 1986, et al. (författare) HighwayNoC: Approaching Ideal NoC Performance With Dual Data Rate Routers 2021 Ingår i: IEEE/ACM Transactions on Networking. - 1558-2566 .- 1063-6692. ; 29:1, s. 318-331 Tidskriftsartikel (refereegranskat)abstract This paper describes HighwayNoC,, a Network-on-chip (NoC) that approaches ideal network performance using a Dual Data Rate (DDR) datapath. Based on the observation that routers datapath is faster than control, a DDR NoC allows flits to be routed at DDR improving throughput to rates defined solely by the datapath, rather than by the control. DDR NoCs can use pipeline bypassing to reduce packet latency at low traffic load. However, existing DDR routers offer bypassing only on in-network, non-turning hops to simplify the required logic. HighwayNoC, extends bypassing support of DDR routers to local ports, allowing flits to enter and exit the network faster. Moreover, it simplifies the DDR switch allocation and the interface of router ports reducing power and area costs. Post place and route results in 28 nm technology show that HighwayNoC, performs better than current state of the art NoCs. Compared to previous DDR NoCs, HighwayNoC, reduces average packet latency by 7.3-27% and power consumption by 1-10%, without affecting throughput. Compared to existing Single Data Rate NoCs, HighwayNoC, achieves 17-22% higher throughput, has similar or up to 13.8% lower packet latency, and mixed energy efficiency results.
8.	Knyaginin, Dmitry, 1983, et al. (författare) Adaptive row addressing for cost-efficient parallel memory protocols in large-capacity memories 2016 Ingår i: MEMSYS 2016: International Symposium on Memory Systems. - New York, NY, USA : ACM. - 9781450343053 ; 03-06-October-2016, s. 121-132 Konferensbidrag (refereegranskat)abstract Modern commercial workloads drive a continuous demand for larger and still low-latency main memories. JEDEC member companies indicate that parallel memory protocols will remain key to such memories, though widening the bus (increasing the pin count) to address larger capacities would cause multiple issues ultimately reducing the speed (the peak data rate) and cost-efficiency of the protocols. Thus to stay high-speed and cost-efficient, parallel memory protocols should address larger capacities using the available number of pins. This is accomplished by multiplexing the pins to transfer each address in multiple bus cycles, implementing Multi-Cycle Addressing (MCA). However, additional address-transfer cycles can significantly worsen performance and energy efficiency. This paper contributes with the concept of adaptive row addressing that comprises row-address caching to reduce the number of address-transfer cycles, enhanced by row-address prefetching and an adaptive row-access priority policy to improve state-of-the-art memory schedulers. For a case-study MCA protocol, the paper shows that the proposed concept improves: i) the read latency by 7.5% on average and up to 12.5%, and ii) the system-level performance and energy efficiency by 5.5% on average and up to 6.5%. This way, adaptive row addressing makes the MCA protocol as efficient as an idealistic protocol of the same speed but with enough pins to transfer each row address in a single bus cycle.
9.	Knyaginin, Dmitry, 1983, et al. (författare) ProF: Probabilistic Hybrid Main Memory Management for High Performance and Fairness 2016 Rapport (övrigt vetenskapligt/konstnärligt)abstract Emerging Non-Volatile Memory (NVM) technologies revolutionize main memory design by enabling hybrid main memory with two partitions: M1 and M2. Such hybrid main memory is built from fast and expensive DRAM (M1) and slower but less expensive NVM (M2) realizing a large, cost-effective, and still high-performance main memory. We consider in this paper a flat, migrating hybrid memory in which hot data blocks are moved from M2 to M1. A challenging issue to manage such a hybrid memory is to achieve both high system-level performance and high fairness among individual programs in a multiprogrammed workload.This paper introduces ProF: Probabilistic hybrid main memory management for high performance and Fairness – a novel approach using the Bayes rule to classify which blocks to migrate to M1. ProF comprises i) a Probabilistic Data migration Mechanism (PDM) that decides which data to move between M1 and M2 to achieve high system performance, and ii) a Slowdown Estimation Mechanism (SEM ) that monitors individual program slowdown and guides PDM towards high fairness. We show that for the multiprogrammed workloads evaluated ProF improves fairness by 9% on average and up to 27% compared to the state-of-the-art, while out- performing it by 9% on average and up to 25%.
10.	Knyaginin, Dmitry, 1983, et al. (författare) ProFess: A Probabilistic Hybrid Main Memory Management Framework for High Performance and Fairness 2018 Ingår i: Proceedings - International Symposium on High-Performance Computer Architecture. - 1530-0897. ; 2018-February, s. 143-155 Konferensbidrag (refereegranskat)abstract Non-Volatile Memory (NVM) technologies enable cost-effective hybrid main memories with two partitions: M1 (DRAM) and slower but larger M2 (NVM). This paper considers a flat migrating organization of hybrid memories. A challenging and open issue of managing such memories is to allocate M1 among co-running programs such that high fairness is achieved at the same time as high performance. This paper introduces ProFess: a Probabilistic hybrid main memory management Framework for high performance and fairness. It comprises: i) a Relative-Slowdown Monitor (RSM) that enables fair management by indicating which program suffers the most from competition for M1; and ii) a probabilistic Migration-Decision Mechanism (MDM) that unlocks high performance by realizing cost-benefit analysis that is individual for each pair of data blocks considered for migration. Within ProFess, RSM guides MDM towards high fairness. We show that for the multiprogrammed workloads evaluated, ProFess improves fairness by 15% (avg.; up to 29%), compared to the state-of-the-art, while outperforming it by 12% (avg.; up to 29%).

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Resultat 1-10 av 23

Avgränsa träffmängd

Typ av publikation: konferensbidrag (13); tidskriftsartikel (7); rapport (3)

Typ av innehåll: refereegranskat (20); övrigt vetenskapligt/konstnärligt (3)

Författare/redaktör: Papaefstathiou, Vasi ... (23); Stenström, Per, 1957 (11); Sourdis, Ioannis, 19 ... (10); Manivannan, Madhavan ... (7); Petersen Moura Tranc ... (6); Pericas, Miquel, 197 ... (6); visa fler...; Vasilakis, Evangelos ... (6); Ejaz, Ahsen, 1986 (4); Psathakis, A. (3); Papaefstathiou, Ioan ... (3); Pnevmatikatos, Dioni ... (3); Knyaginin, Dmitry, 1 ... (3); Katevenis, M. (2); Mavroidis, Iakovos (2); Alvarez, Lluc (1); Ruiz, Abraham (1); Bigas-Soldevilla, Ar ... (1); Kuroedov, Pavel (1); Gonzalez, Alberto (1); Mahale, Hamsika (1); Bustamante, Noe (1); Aguilera, Albert (1); Minervini, Francesco (1); Salamero, Javier (1); Palomar, Oscar (1); Dimou, Nikolaos (1); Giaourtas, Michalis (1); Mastorakis, Iasonas (1); Ieronymakis, Georgio ... (1); Matzouranis, Georgio ... (1); Flouris, Vasilis (1); Kossifidis, Nick (1); Marazakis, Manolis (1); Goel, Bhavishya, 198 ... (1); Strikos, Panagiotis, ... (1); Vázquez Maceiras, Ma ... (1); Hagemeyer, Jens (1); Tigges, L. (1); Kucza, Nils (1); Philippe, Jean Marc (1); Azhar, Muhammad Waqa ... (1); Goodacre, John (1); Ioannou, Aggelos D. (1); Georgopoulos, Konsta ... (1); Malakonakis, Pavlos (1); Malek, Alirad, 1983 (1); Lavagno, Luciano (1); Nikolopoulos, Dimitr ... (1); Koch, Dirk (1); Coppola, Marcello (1); visa färre...

Lärosäte: Chalmers tekniska högskola (23)

Språk: Engelska (23)

Forskningsämne (UKÄ/SCB): Naturvetenskap (17); Teknik (16)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

Copyright © LIBRIS - Nationella bibliotekssystem
LIBRIS.kb.se

pil uppåt

Stäng

Kopiera och spara länken för att återkomma till aktuell vy