SwePub - sökning: WFRF:(Själander Magnus 1977)

Numrering	Referens	Omslagsbild	Hitta
1.	Björk, Magnus, 1977, et al. (författare) Exposed Datapath for Efficient Computing 2006 Rapport (övrigt vetenskapligt/konstnärligt)abstract We introduce FlexCore, which is the first exemplar of a processor based on the FlexSoC processor paradigm. TheFlexCore utilizes an exposed datapath for increased performance. Microbenchmarks yield a performance boost of a factor of two over a traditional five-stage pipeline with the same functional units as the FlexCore.We describe our approach to compiling for the FlexCore.A flexible interconnect allows the FlexCore datapath to bedynamically reconfigured as a consequence of code generation. Additionally, specialized functional units may be introduced and utilized within the same architecture and compilation framework. The exposed datapath requires a wide control word. The conducted evaluation of two micro benchmarks confirms that this increases the instruction bandwidth and memory footprint. This calls for an efficient instruction decoding as proposed in the FlexSoC paradigm.
2.	Björk, Magnus, 1977, et al. (författare) Exposed Datapath for Efficient Computing 2007 Ingår i: 2007 HiPEAC Workshop on Reconfigurable Computing. Konferensbidrag (refereegranskat)
3.	Thuresson, Martin, 1977, et al. (författare) FlexCore: Utilizing Exposed Datapath Control for Efficient Computing 2007 Ingår i: IEEE SAMOS 2007. ; , s. 18-25 Konferensbidrag (refereegranskat)
4.	Thuresson, Martin, 1977, et al. (författare) FlexCore: Utilizing Exposed Datapath Control for Efficient Computing 2009 Ingår i: Journal of Signal Processing Systems. - : Springer Science and Business Media LLC. - 1939-8018 .- 1939-8115. ; 57:1, s. 5-19 Tidskriftsartikel (refereegranskat)abstract We introduce FlexCore, the first exemplar of an architecture based on the FlexSoC framework. Comprising the same datapath units found in a conventional five-stage pipeline, the FlexCore has an exposed datapath control and a flexible interconnect to allow the datapath to be dynamically reconfigured as a consequence of code generation. Additionally, the FlexCore allows specialized datapath units to be inserted and utilized within the same architecture and compilation framework.This study shows that, in comparison to a conventional five-stage general-purpose processor, the FlexCore is up to 40\% more efficient in terms of cycle count on a set of benchmarks from the embedded application domain. We show that both the fine-grained control and the flexible interconnect contribute to the speedup. Furthermore, our synthesized, placed and routed FlexCore offers savings both in energy and execution time.The exposed FlexCore datapath requires a wide control word. The conducted evaluation confirms that this increases the instruction bandwidth and memory footprint. This calls for efficient instruction decoding as proposed in the FlexSoC framework.
5.	Själander, Magnus, 1977, et al. (författare) A Flexible Datapath Interconnect for Embedded Applications 2007 Ingår i: IEEE Computer Society Annual Symposium on VLSI. ; , s. 15-20 Konferensbidrag (refereegranskat)abstract We investigate the effects of introducing a flexible interconnect into an exposed datapath. We define an exposed datapath as a traditional GPP datapath that has its normal control removed, leading to the exposure of a wide control word. For an FFT benchmark, the introduction of a flexible interconnect reduces the total execution time by 16%. Compared to a traditional GPP, the execution time for an exposed datapath using a flexible interconnect is 32% shorter whereas the energy dissipation is 29% lower. Our investigation is based on a cycleaccurate architectural simulator and figures on delay, power, and area are obtained from placed-and-routed layouts in a commercial 0.13-ìm technology. The results from our case studies indicate that by utilizing a flexible interconnect, significant performance gains can be achieved for generic applications.
6.	Thuresson, Martin, 1977, et al. (författare) A Flexible Code-Compression Scheme using Partitioned Look-Up Tables 2009 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. - 3540929894 ; 5409 LNCS, s. 95-109 Konferensbidrag (refereegranskat)abstract Wide instruction formats make it possible to control microarchitecture resources more precisely by the compiler by either enabling more parallelism (VLIW) or by saving power. Unfortunately, wide instructions impose a high pressure on the memory system due to an increased instruction-fetch bandwidth and a larger code working set/footprint. This paper presents a code compression scheme that allows the compiler to select what subset of a wide instruction set to use in each program phase at the granularity of basic blocks based on a profiling methodology. The decompression engine comprises a set of tables that convert a narrow instruction into a wide instruction in a dynamic fashion. The paper also presents a method for how to configure and dimension the decompression engine and how to generate a compressed program with embedded instructions that dynamically manage the tables in the decompression engine. We find that the 77 control bits in the original FlexCore instruction format can be reduced to 32 bits offering a compression of 58% and a modest performance overhead of less than 1% for management of the decompression tables.
7.	Thuresson, Martin, 1977, et al. (författare) A Flexible Code Compression Scheme using Partitioned Look-Up Tables 2008 Rapport (övrigt vetenskapligt/konstnärligt)abstract Wide instruction formats make it possible to controlmicroarchitecture resources more finely by enabling more parallelism(VLIW) or by utilizing the microarchitecture more efficiently byexposing the control to the compiler. Unfortunately, wideinstructions impose a higher pressure on the memory system due to anincreased instruction-fetch bandwidth and a larger code workingset/footprint.This paper presents a code compression scheme that allows thecompiler to select what subset of the wide instruction set to usein each program phase at the granularity of basic blocks based on aprofiling methodology. The decompression engine comprises a set oftables that convert a narrow instruction into a wide instruction ina dynamic fashion. The paper also presents a method for how toconfigure and dimension the decompression engine and how togenerate a compressed program with embedded instructions thatdynamically manage the tables in the decompression engine.We find that the 77 control bits in the original FlexCoreinstruction format can be reduced to 32 bits offering a compressionof 58% and a modest performance overhead of less than 1% formanagement of the decompression tables.
8.	Azhar, Muhammad Waqar, 1986, et al. (författare) Viterbi Accelerator for Embedded Processor Datapaths 2012 Ingår i: Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors. - 1063-6862. - 9780769547688 ; , s. 133-140 Konferensbidrag (refereegranskat)abstract We present a novel architecture for a lightweight Viterbi accelerator that can be tightly integrated inside an embedded processor. We investigate the accelerator’s impact on processor performance by using the EEMBC Viterbi benchmark and the in-house Viterbi Branch Metric kernel. Our evaluation based on the EEMBC benchmark shows that an accelerated 65-nm 2.7-ns processor datapath is 20% larger but 90% more cycle efficient than a datapath lacking the Viterbi accelerator, leading to an 87% overall energy reduction and a data throughput of 3.52 Mbit/s.
9.	Bardizbanyan, Alen, 1986, et al. (författare) Designing a Practical Data Filter Cache to Improve Both Energy Efficiency and Performance 2013 Ingår i: Transactions on Architecture and Code Optimization. - 1544-3973 .- 1544-3566. ; 10:4, s. 25 pages- Tidskriftsartikel (refereegranskat)abstract Conventional Data Filter Cache (DFC) designs improve processor energy efficiency, but degrade performance. Furthermore, the single-cycle line transfer suggested in prior studies adversely affects Level-1 Data Cache (L1 DC) area and energy efficiency. We propose a practical DFC that is accessed early in the pipeline and transfers a line over multiple cycles. Our DFC design improves performance and eliminates a substantial fraction of L1 DC accesses for loads, L1 DC tag checks on stores, and data translation lookaside buffer accesses for both loads and stores. Our evaluation shows that the proposed DFC can reduce the data access energy by 42.5% and improve execution time by 4.2%.
10.	Bardizbanyan, Alen, 1986, et al. (författare) Improving Data Access Efficiency by Using a Tagless Access Buffer (TAB) 2013 Ingår i: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2013. - 9781467355254 ; , s. 269-279 Konferensbidrag (refereegranskat)abstract The need for energy efficiency continues to grow for many classes of processors, including those for which performance remains vital. Data cache is crucial for good performance, but it also represents a significant portion of the processor's energy expenditure. We describe the implementation and use of a tagless access buffer (TAB) that greatly improves data access energy efficiency while slightly improving performance. The compiler recognizes memory reference patterns within loops and allocates these references to a TAB. This combined hardware/software approach reduces energy usage by (1) replacing many level-one data cache (L1D) accesses with accesses to the smaller, more power-efficient TAB; (2) removing the need to perform tag checks or data translation lookaside buffer (DTLB) lookups for TAB accesses; and (3) reducing DTLB lookups when transferring data between the L1D and the TAB. Accesses to the TAB occur earlier in the pipeline, and data lines are prefetched from lower memory levels, which result in asmall performance improvement. In addition, we can avoid many unnecessary block transfers between other memory hierarchy levels by characterizing how data in the TAB are used. With a combined size equal to that of a conventional 32-entry register file, a four-entry TAB eliminates 40% of L1D accesses and 42% of DTLB accesses, on average. This configuration reduces data-access related energy by 35% while simultaneously decreasing execution time by 3%.
11.	Bardizbanyan, Alen, 1986, et al. (författare) Reconfigurable Instruction Decoding for a Wide-Control-Word Processor 2011 Ingår i: Proceedings of Reconfigurable Architectures Workshop (RAW), IEEE International Parallel & Distributed Processing Symposium (IPDPS). - 9780769543857 ; , s. 322-325 Konferensbidrag (refereegranskat)abstract Fine-grained control through the use of a wide control word can lead to high instruction-level parallelism, but unless compressed the words require a large memory footprint. A reconfigurable fixed-length decoding scheme can be created by taking advantage of the fact that an application only uses a subset of the datapath for its execution. We present the first complete implementation of the FlexCore processor, integrating a wide-control-word datapath with a run-time reconfigurable instruction decompressor. Our evaluation, using three different EEMBC benchmarks, shows that it is possible to reach up to 35% speedup compared to a five-stage pipelined MIPS processor, assuming the same datapath units. In addition, our VLSI implementations show that this FlexCore processor offers up to 24% higher energy efficiency than the MIPS reference processor.
12.	Bardizbanyan, Alen, 1986, et al. (författare) Speculative Tag Access for Reduced Energy Dissipation in Set-Associative L1 Data Caches 2013 Ingår i: Proceedings of IEEE International Conference on Computer Design (ICCD), Asheville, NC, USA, October 6-9 2013. ; , s. 302-308 Konferensbidrag (refereegranskat)abstract Due to performance reasons, all ways in set-associative level-one (L1) data caches are accessed in parallel for load operations even though the requested data can only reside in one of the ways. Thus, a significant amount of energy is wasted when loads are performed. We propose a speculation technique that performs the tag comparison in parallel with the address calculation, leading to the access of only one way during the following cycle on successful speculations. The technique incurs no execution time penalty, has an insignificant area overhead, and does not require any customized SRAM implementation. Assuming a 16kB 4-way set-associative L1 data cache implemented in a 65-nm process technology, our evaluation based on 20 different MiBench benchmarks shows that the proposed technique on average leads to a 24% data cache energy reduction.
13.	Bardizbanyan, Alen, 1986, et al. (författare) Towards a Performance- and Energy-Efficient Data Filter Cache 2013 Ingår i: Workshop on Optimizations for DSP and Embedded Systems (ODES), Proceedings of International Symposium on Code Generation and Optimization (CGO), Shenzhen, China, Feb. 23-27. - New York, NY, USA : ACM. - 9781450319058 ; , s. 21-28 Konferensbidrag (refereegranskat)abstract As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25% of an embedded processor's total power dissipation, techniques that decrease L1 DC accesses can significantly enhance processor energy efficiency. Filter caches are known to efficiently decrease the number of accesses to instruction caches. However, due to the irregular access pattern of data accesses, a conventional data filter cache (DFC) has a high miss rate, which degrades processor performance. We propose to integrate a DFC with a fast address calculation technique to significantly reduce the impact of misses and to improve performance by enabling one-cycle loads. Furthermore, we show that DFC stalls can be eliminated even after unsuccessful fast address calculations, by simultaneously accessing the DFC and L1 DC on the following cycle. We quantitatively evaluate different DFC configurations, with and without the fast address calculation technique, using different write allocation policies, and qualitatively describe their impact on energy efficiency. The proposed design provides an efficient DFC that yields both energy and performance improvements.
14.	Brinck, Martin, 1979, et al. (författare) An Efficient FFT Engine Based on Twin-Precision Computation 2006 Ingår i: Swedish System--on-Chip Conference. Konferensbidrag (övrigt vetenskapligt/konstnärligt)
15.	Eriksson, Henrik, 1974, et al. (författare) Multiplier Reduction Tree with Logarithmic Logic Depth and Regular Connectivity 2006 Ingår i: IEEE Intl Symposium on Circuits and Systems (ISCAS). Konferensbidrag (refereegranskat)
16.	Frolov, Nikita, 1986, et al. (författare) A SAT-Based Compiler for FlexCore 2011 Rapport (övrigt vetenskapligt/konstnärligt)abstract Much like VLIW, statically scheduled architectures that expose all control signals to the compiler offer much potential for highly parallel, energy-efficient performance. Bau is a novel compilation infrastructure that leverages the LLVM compilation tools and the MiniSAT solver to generate efficient code for one such exposed architecture. We first build a compiler construction library that allows scheduling and resource constraints to be expressed declaratively in a domain-specific language, and then use this library to implement a compiler that generates programs that are 1.2–1.5 times more compact than either a baseline MIPS R2K compiler or a basic-block-based, sequentially phased scheduler.
17.	Frolov, Nikita, 1986, et al. (författare) Declarative, SAT-solver-based Scheduling for an Embedded Architecture with a Flexible Datapath 2011 Ingår i: Swedish System-on-Chip Conference. Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract Much like VLIW, statically scheduled architectures that expose all control signals to the compiler offer much potential for highly parallel, energy-efficient performance. Bau is a novel compilation infrastructure that leverages the LLVM compilation tools and the MiniSAT solver to generate efficient code for one such exposed architecture. We first build a compiler construction library that allows scheduling and resource constraints to be expressed declaratively in a domain specific language, and then use this library to implement a compiler that generates programs that are 1.2–1.5 times more compact than either a baseline MIPS R2K compiler or a basic-block-based, sequentially phased scheduler.
18.	Goel, Bhavishya, 1981, et al. (författare) Infrastructures for Measuring Power 2011 Rapport (övrigt vetenskapligt/konstnärligt)abstract Energy-aware resource management requires some means of measuring power consumption. We present three approaches to measuring processor power. The easiest, least intrusive places a power meter between the system and power outlet. Unfortunately, this provides a single system measurement, and acuity is limited by device sampling frequency. Another method samples power at PSU voltage outputs using current transducers. This logs consumption separately per component, but requires custom hardware and an expensive analog acquisition device. A more accurate alternative samples power directly at the processor voltage regulator’s current-sensing pin, but requires motherboard intrusion. We explain implementation of each approach step-by-step.
19.	Goel, Bhavishya, 1981, et al. (författare) Techniques to Measure, Model, and Manage Power 2012 Ingår i: Advances in Computers. - 0065-2458. - 9780123965288 ; 87, s. 7-54 Bokkapitel (övrigt vetenskapligt/konstnärligt)abstract Society's increasing dependence on information technology has resulted in the deployment of vast compute resources. The energy costs of operating these resources coupled with environmental concerns have made energy-aware computing one of the primary challenges for the IT sector. Making energy-efficient computing a rule rather than an exception requires that researchers and system designers use the right set of techniques and tools. These involve measuring, analyzing, and controlling the energy expenditure of computers at varying degrees of granularity. In this chapter, we present techniques to measure power consumption of computer systems at various levels and to compare their effectiveness. We discuss methodologies to estimate processor power consumption using performance-counter-based power modeling and show how the power models can be used for power-aware scheduling. Armed with such techniques and methodologies, we as a research and development community can better address challenges in power-aware management.
20.	Goumas, Georgios, 1900, et al. (författare) Adapt or Become Extinct! 2011 Ingår i: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era. - New York, NY, USA : ACM. - 9781450307086 ; , s. 46-51 Konferensbidrag (refereegranskat)abstract The High-Performance Computing ecosystem consists of a large variety of execution platforms that demonstrate a wide diversity in hardware characteristics such as CPU architecture, memory organization, interconnection network, accelerators, etc. This environment also presents a number of hard boundaries (walls) for applications which limit software development (parallel programming wall), performance (memory wall, communication wall) and viability (power wall). The only way to survive in such a demanding environment is by adaptation. In this paper we discuss how dynamic information collected during the execution of an application can be utilized to adapt the execution context and may lead to performance gains beyond those provided by static information and compile-time adaptation. We consider specialization based on dynamic information like user input, architectural characteristics such as the memory hierarchy organization, and the execution profile of the application as obtained from the execution platform's performance monitoring units. One of the challenges of future execution platforms is to allow the seamless integration of these various kinds of information with information obtained from static analysis (either during ahead-of-time or just-in-time) compilation. We extend the notion of information-driven adaptation and outline the architecture of an infrastructure designed to enable information flow and adaptation through-out the life-cycle of an application.
21.	Hoang, Tung, 1980, et al. (författare) A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit 2010 Ingår i: IEEE Transactions on Circuits and Systems I: Regular Papers. - 1549-8328 .- 1558-0806. ; 57:12, s. 3073-3081 Tidskriftsartikel (refereegranskat)abstract We propose a high-speed and energy-efficient two-cycle multiply-accumulate (MAC) architecture that supports two's complement numbers, and includes accumulation guard bits and saturation circuitry. The first MAC pipeline stage contains only partial-product generation circuitry and a reduction tree, while the second stage, thanks to a special sign-extension solution, implements all other functionality. Place-and-route evaluations using a 65-nm 1.1-V cell library show that the proposed architecture offers a 31% improvement in speed and a 32% reduction in energy per operation, averaged across operand sizes of 16, 32, 48, and 64 bits, over a reference two-cycle MAC architecture that employs a multiplier in the first stage and an accumulator in the second. When operating the proposed architecture at the lower frequency of the reference architecture the available timing slack can be used to downsize gates, resulting in a 52% reduction in energy compared to the reference. We extend the new architecture to create a versatile double-throughput MAC (DTMAC) unit that efficiently performs either multiply-accumulate or multiply operations for N-bit, 1 × N/2-bit, or 2 × N/2-bit operands. In comparison to a fixed-function 32-bit MAC unit, 16-bit multiply-accumulate operations can be executed with 67% higher energy efficiency on a 32-bit DTMAC unit. © 2006 IEEE.
22.	Hoang, Tung, 1980, et al. (författare) Design Space Exploration for an Embedded Processor with Flexible Datapath Interconnect 2010 Ingår i: Proceedings of IEEE Int. Conf. on Application-specific Systems, Architectures and Processors (ASAP). - 1063-6862. - 9781424469673 ; , s. 55-62 Konferensbidrag (refereegranskat)abstract The design of an embedded processor is dependent on the application domain. Traditionally, design solutions specific to an application domain have been available in three forms: VLIW-based DSP processors, ASICs and FPGAs; each respectively offering generality of application domain, energy efficiency and flexibility. However, while matching the application domain to the resources needed, the design space becomes huge. We present FlexTools, a tool framework built around the FlexCore architecture to evaluate performance and energy efficiency for different applications. Here we demonstrate FlexTools for design space exploration with a focus on the data-routing flexibility of the FlexCore processor, in search of energy-efficient interconnect configurations that are both cycle-count and hardware efficient. Evaluation results suggest that a well-optimized instance of a 65nm multiplier-extended FlexCore processor datapath, obtained using FlexTools, executes nine integer EEMBC benchmarks with a 15% cycle count reduction and dissipates 17% less energy than a reference MIPS datapath. © 2010 IEEE.
23.	Hoang, Tung, 1980, et al. (författare) Double Throughput MAC for Performance Enhancement of the FlexCore Processor 2008 Ingår i: Swedish System-on-Chip Conference. Konferensbidrag (övrigt vetenskapligt/konstnärligt)
24.	Hoang, Tung, 1980, et al. (författare) Double Throughput Multiply-Accumulate Unit for FlexCore Processor Enhancements 2009 Ingår i: 23rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2009; Rome; Italy; 23 May 2009 through 29 May 2009. - 9781424437504 Konferensbidrag (refereegranskat)abstract As a simple five-stage General-Purpose Processor (GPP), the baseline FlexCore processor has a limited set of datapath units. By utilizing a flexible datapath interconnect and a wide control word, a FlexCore processor is explicitly designed to support integration of special units that, on demand, can accelerate certain data-intensive applications. In this paper, we propose the integration of a novel Double Throughput Multiply-Accumulate (DTMAC) unit, whose different operating modes allow for on-thefly optimization of computational precision. For the two EEMBC benchmarks considered, the FlexCore processor performance is significantly enhanced when one DTMAC accelerator is included, translating into reduced execution time and energy dissipation. In comparison to the 32-bit GPP reference, the accelerated 32- bit FlexCore processor shows a 4.37x improvement in execution time and a 3.92x reduction in energy dissipation, for a benchmark with many consecutive 16-bit MAC operations.
25.	Hoang, Tung, 1980, et al. (författare) High-Speed, Energy-Efficient 2-Cycle Multiply-Accumulate Architecture 2009 Ingår i: Proceedings of IEEE Intl SoC Conference (SoCC). - 9781424452200 ; , s. 119-122 Konferensbidrag (refereegranskat)
26.	Hoang, Tung, 1980, et al. (författare) Ultra-Low-Power 2-Cycle Multiply-Accumulate Architecture 2009 Ingår i: Swedish System-on-Chip Conference (SSoCC). Konferensbidrag (övrigt vetenskapligt/konstnärligt)
27.	Islam, Mafijul, 1975, et al. (författare) Early Detection and Bypassing of Trivial Operations to Improve Energy Efficiency of Processors 2008 Ingår i: Microprocessors and Microsystems, Elsevier. ; 42:4, s. 183-196 Tidskriftsartikel (refereegranskat)
28.	Jälmbrant, Ulf, 1982, et al. (författare) Design-Time Scheduling for Processor Exploration 2009 Ingår i: Swedish System-on-Chip Conference (SSoCC). Konferensbidrag (övrigt vetenskapligt/konstnärligt)
29.	Kimfors, Patrik, et al. (författare) Custom Layout Strategy for Rectangle-Shaped Log-Depth Multiplier Reduction Tree 2009 Ingår i: Proceedings of IEEE International Conference of Electronics, Circuits and Systems. Konferensbidrag (refereegranskat)
30.	Moreau, Daniel, 1990, et al. (författare) CREEP: Chalmers RTL-based Energy Evaluation of Pipelines 2017 Rapport (övrigt vetenskapligt/konstnärligt)abstract Energy estimation at architectural level is vital since early design decisions have the greatest impact on the final implementation of an electronic system. It is, however, a particular challenge to perform energy evaluations for processors: While the software presents the processor designer with methodological problems related to, e.g., choice of benchmarks, technology scaling has made implementation properties depend strongly on, e.g., different circuit optimizations such as those used during timing closure. However tempting it is to modularize the hardware, this common method of using decoupled pipeline building blocks for energy estimation is bound to neglect implementation and integration aspects that are increasingly important. We introduce CREEP, an energy-evaluation framework for processor pipelines, which at its core has an accurate 65-nm CMOS implementation model of different configurations of a MIPS-I-like pipeline including level-1 caches. While CREEP by default uses already existing estimated post-layout data, it is also possible for an advanced user to modify the pipeline RTL code or retarget the RTL code to a different process technology. We describe the CREEP evaluation flow, the components and tools used, and demonstrate the framework by analyzing a few different processor configurations in terms of energy and performance.
31.	Moreau, Daniel, 1990, et al. (författare) Practical Way Halting by Speculatively Accessing Halt Tags 2016 Ingår i: 19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016, Dresden, Germany, 14-18 March 2016. - Singapore : Research Publishing Services. - 1530-1591. - 9783981537062 ; , s. 1375-1380 Konferensbidrag (refereegranskat)abstract Conventional set-associative data cache accesses waste energy since tag and data arrays of several ways are simultaneously accessed to sustain pipeline speed. Different access techniques to avoid activating all cache ways have been previously proposed in an effort to reduce energy usage. However, a problem that many of these access techniques have in common is that they need to access different cache memory portions in a sequential manner, which is difficult to support with standard synchronous SRAM memory. We propose the speculative halt-tag access (SHA) approach, which accesses low-order tag bits, i.e., the halt tag, in the address generation stage instead of the SRAM access stage to eliminate accesses to cache ways that cannot possibly contain the data. The key feature of our SHA approach is that it determines which tag and data arrays need to be accessed early enough for conventional SRAMs to be used. We evaluate the SHA approach using a 65-nm processor implementation running MiBench benchmarks and find that it on average reduces data access energy by 25.6%.
32.	Nishtala, Rajiv, et al. (författare) Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services 2020 Konferensbidrag (refereegranskat)abstract Many of the important services running on data centres are latency-critical, time-varying, and demand strict user satisfaction. Stringent tail-latency targets for colocated services and increasing system complexity make it challenging to reduce the power consumption of data centres. Data centres typically sacrifice server efficiency to maintain tail-latency targets resulting in an increased total cost of ownership. This paper introduces Twig, a scalable quality-of-service (QoS) aware task manager for latency-critical services co-located on a server system. Twig successfully leverages deep reinforcement learning to characterise tail latency using hardware performance counters and to drive energy-efficient task management decisions in data centres. We evaluate Twig on a typical data centre server managing four widely used latency-critical services. Our results show that Twig outperforms prior works in reducing energy usage by up to 38% while achieving up to 99% QoS guarantee for latency-critical services.
33.	Reissmann, Nico, et al. (författare) RVSDG : An Intermediate Representation for Optimizing Compilers 2020 Ingår i: ACM Transactions on Embedded Computing Systems. - : Association for Computing Machinery (ACM). - 1539-9087 .- 1558-3465. ; 19:6 Tidskriftsartikel (refereegranskat)abstract Intermediate Representations (IRs) are central to optimizing compilers as the way the program is represented may enhance or limit analyses and transformations. Suitable IRs focus on exposing the most relevant information and establish invariants that different compiler passes can rely on. While control-flow centric IRs appear to be a natural fit for imperative programming languages, analyses required by compilers have increasingly shifted to understand data dependencies and work at multiple abstraction layers at the same time. This is partially evidenced in recent developments such as the Multi-Level Intermediate Representation (MLIR) proposed by Google. However, rigorous use of data flow centric IRs in general purpose compilers has not been evaluated for feasibility and usability as previous works provide no practical implementations.We present the Regionalized Value State Dependence Graph (RVSDG) IR for optimizing compilers. The RVSDG is a data flow centric IR where nodes represent computations, edges represent computational dependencies, and regions capture the hierarchical structure of programs. It represents programs in demand-dependence form, implicitly supports structured control flow, and models entire programs within a single IR. We provide a complete specification of the RVSDG, construction and destruction methods, as well as exemplify its utility by presenting Dead Node and Common Node Elimination optimizations. We implemented a prototype compiler and evaluate it in terms of performance, code size, compilation time, and representational overhead. Our results indicate that the RVSDG can serve as a competitive IR in optimizing compilers while reducing complexity.
34.	Ryman, Erik J, 1982, et al. (författare) FlexTools: Design Space Exploration Tool Chain from C to Physical Implementation 2010 Ingår i: CDNLive! EMEA. Konferensbidrag (refereegranskat)
35.	Sakalis, Christos, et al. (författare) Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks Annan publikation (övrigt vetenskapligt/konstnärligt)abstract MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristics of speculative execution to trap the execution of the victim application in a loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack software that are shielded against replay, even under conditions where a side-channel attack would not be possible (e.g., in secure enclaves). At the same time, unlike speculative side-channel attacks, microarchitectural replay attacks can be used to amplify the correct path of execution, rendering many existing speculative side-channel defences ineffective.In this work, we generalize microarchitectural replay attacks beyond MicroScope and present an efficient defence against them. We make the observation that such attacks rely on repeated squashes of so-called "replay handles" and that the instructions causing the side-channel must reside in the same reorder buffer window as the handles. We propose Delay-on-Squash, a hardware-only technique for tracking squashed instructions and preventing them from being replayed by speculative replay handles. Our evaluation shows that it is possible to achieve full security against microarchitectural replay attacks with very modest hardware requirements, while still maintaining 97% of the insecure baseline performance.
36.	Sakalis, Christos, et al. (författare) Do Not Predict – Recompute! : How Value Recomputation Can Truly Boost the Performance of Invisible Speculation 2021 Ingår i: 2021 International Symposium on Secure and Private Execution Environment Design (SEED). - : Institute of Electrical and Electronics Engineers (IEEE). - 9781665420259 ; , s. 89-100 Konferensbidrag (refereegranskat)abstract Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs performance, prompting the use of value prediction (VP) to regain some of the delay.However, the problem cannot be solved by simply introducing a new kind of speculation (value prediction). Value-predicted loads have to be validated, which cannot be commenced until the load becomes non-speculative. Thus, value-predicted loads occupy the same amount of precious core resources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yields marginal benefits over Delay-on-Miss.In this paper, our insight is that we can achieve the same goal as VP (increasing performance by providing the value of loads that miss) without incurring its negative side-effect (delaying the release of precious resources), if we can safely, non-speculatively, recompute a value in isolation (without being seen from the outside), so that we do not expose any information by transferring such a value via the memory hierarchy. Value Recomputation, which trades computation for data transfer was previously proposed in an entirely different context: to reduce energy-expensive data transfers in the memory hierarchy. In this paper, we demonstrate the potential of value recomputation in relation to the Delay-on-Miss approach of hiding speculation, discuss the trade-offs, and show that we can achieve the same level of security, reaching 93% of the unsecured baseline performance (5% higher than Delay-on-miss), and exceeding (by 3%) what even an oracular (100% accuracy and coverage) value predictor could do.
37.	Sakalis, Christos, et al. (författare) Evaluating the Potential Applications of Quaternary Logic for Approximate Computing 2020 Ingår i: ACM Journal on Emerging Technologies in Computing Systems. - : Association for Computing Machinery (ACM). - 1550-4832 .- 1550-4840. ; 16:1 Tidskriftsartikel (refereegranskat)abstract There exist extensive ongoing research efforts on emerging atomic-scale technologies that have the potential to become an alternative to today’s complementary metal--oxide--semiconductor technologies. A common feature among the investigated technologies is that of multi-level devices, particularly the possibility of implementing quaternary logic gates and memory cells. However, for such multi-level devices to be used reliably, an increase in energy dissipation and operation time is required. Building on the principle of approximate computing, we present a set of combinational logic circuits and memory based on multi-level logic gates in which we can trade reliability against energy efficiency. Keeping the energy and timing constraints constant, important data are encoded in a more robust binary format while error-tolerant data are encoded in a quaternary format. We analyze the behavior of the logic circuits when exposed to transient errors caused as a side effect of this encoding. We also evaluate the potential benefit of the logic circuits and memory by embedding them in a conventional computer system on which we execute jpeg, sobel, and blackscholes approximately. We demonstrate that blackscholes is not suitable for such a system and explain why. However, we also achieve dynamic energy reductions of 10% and 13% for jpeg and sobel, respectively, and improve execution time by 38% for sobel, while maintaining adequate output quality.
38.	Sakalis, Christos, 1990- (författare) Rethinking Speculative Execution from a Security Perspective 2021 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract Speculative out-of-order execution is one of the fundamental building blocks of modern, high-performance processors. To maximize the utilization of the system's resources, hardware and software security checks in the speculative domain can be temporarily ignored, without affecting the correctness of the application, as long as no architectural changes are made before transitioning to the non-speculative domain. Similarly, the microarchitectural state of the system, which is by necessity modified for every single operation (speculative or otherwise) also does not affect the correctness of the application, as such state is meant to be invisible on the architectural level. Unfortunately, while the microarchitectural state of the system is indeed separate from the architectural state and is typically hidden from the users, it can still be observed indirectly through its side-effects, through the use of "side-channels". Starting with Meltdown and Spectre, speculative execution, combined with existing side-channel attacks, can be abused to bypass both hardware and software security barriers and illegally gain access to data that would not be accessible otherwise.Embroiled in a battle between security and efficiency, computer architects have designed numerous microarchitectural solutions to this issue, all the while new attacks are being constantly discovered. This thesis proposes two such speculative side-channel defenses, Ghost loads and Delay-on-Miss, both of which protect against speculative side-channel attacks targeting the cache and memory hierarchy as their side-channel. Ghost loads work by making speculative loads invisible in the memory hierarchy, while Delay-on-Miss, which is both simpler and more secure than Ghost loads, restricts speculative loads from even reaching many levels of the hierarchy.At the same time, this thesis also tackles security problems brought on by speculative execution that are not themselves speculative side-channel attacks, namely microarchitectural replay attacks. In the latter, the attacker abuses speculative execution not to gain access to data but to amplify an otherwise already existing side-channel. This is achieved by trapping the execution of a victim application in a repeating window of speculation, forcing it to constantly squash and re-execute the same side-channel instructions again and again. To counter such attacks, Delay-on-Squash is introduced, which prevents instructions from being replayed in the same window of speculation, hence stopping any microarchitectural replay attempts.Overall, between Delay-on-Squash, Delay-on-Miss, and Ghost loads, this thesis covers a wide range of insecure microarchitectural behaviors and secure countermeasures for them, all the while balancing the trade-offs between security, performance, and complexity.
39.	Sakalis, Christos (författare) Securing the Memory Hierarchy from Speculative Side-Channel Attack 2020 Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract Modern high-performance CPUs depend on speculative out-of-order execution in order to offer high performance while also remaining energy efficient. However, with the introduction of Meltdown and Spectre in the beginning of 2018, speculative execution has been under attack. These exploits, and the many that followed, take advantage of the unchecked nature of speculative execution and the microarchitectural changes it causes in order to mount speculative side-channel attacks. Such attacks can bypass software and hardware barriers and gain access to sensitive information while remaining invisible to the application. In this thesis we will describe our work on preventing speculative side-channel attacks that exploit the memory hierarchy as their side-channel. Specifically, we will discuss two different approaches, one were we do not restrict speculative execution but try to keep its microarchitectural side-effects hidden, and one where we delay speculative memory accesses if we determine that they might lead to information leakage. We will discuss the advantages and disadvantages of both approaches, compare them against other state-of-the-art solutions, and show that it is possible to achieve secure, invisible speculation while at the same time maintaining high performance and efficiency.
40.	Sakalis, Christos, et al. (författare) Seeds of SEED : Preventing Priority Inversion in Instruction Scheduling to Disrupt Speculative Interference 2021 Ingår i: 2021 International Symposium on Secure and Private Execution Environment Design (SEED). - : Institute of Electrical and Electronics Engineers (IEEE). - 9781665420259 ; , s. 101-107 Konferensbidrag (refereegranskat)abstract Speculative side-channel attacks consist of two parts: The speculative instructions that abuse speculative execution to gain illegal access to sensitive data and the side-channel instructions that leak the sensitive data. Typically, the side-channel instructions are assumed to follow the speculative instructions and be dependent on them. Speculative side-channel defenses have taken advantage of these facts to construct solutions where speculative execution is limited only under the presence of these conditions, in an effort to limit the performance overheads introduced by the defense mechanisms. Unfortunately, it turns out that only focusing on dependent instructions enables a new set of attacks, referred to as "speculative interference attacks". These are a new variant of speculative side-channel attacks, where the side-channel instructions are placed before the point of misspeculation and hence before any illegal speculative instructions. As this breaks the previous assumptions on how speculative side-channel attacks work, this new attack variant can be used to bypass many of the existing defenses. We argue that the root cause of speculative interference is a priority inversion between the scheduling of older, bound to be committed, and younger, bound to be squashed instructions, which affects the execution order of the former. This priority inversion can be caused by affecting either the readiness of a not-yet-ready older instruction or the issuing priority of an older instruction after it becomes ready. We disrupt the opportunity for speculative interference by ensuring that current defenses adequately prevent the interference of younger instructions with the availability of operands to older instructions and by proposing an instruction scheduling policy to preserve the priority of ready instructions. As a proof of concept, we also demonstrate how the prevention of scheduling-priority inversion can safeguard a specific defense, Delay-on-Miss, from the possibility of speculative interference attacks. We first discuss why it is susceptible to interference attacks and how this can be corrected without introducing any additional performance costs or hardware complexity, with simple instruction scheduling rules.
41.	Sakalis, Christos, et al. (författare) Understanding Selective Delay as a Method for Efficient Secure Speculative Execution 2020 Ingår i: IEEE Transactions on Computers. - 0018-9340 .- 1557-9956. ; 69:11, s. 1584-1595 Tidskriftsartikel (refereegranskat)abstract Since the introduction of Meltdown and Spectre, the research community has been tirelessly working on speculative side-channel attacks and on how to shield computer systems from them. To ensure that a system is protected not only from all the currently known attacks but also from future, yet to be discovered, attacks, the solutions developed need to be general in nature, covering a wide array of system components, while at the same time keeping the performance, energy, area, and implementation complexity costs at a minimum. One such solution is our own delay-on-miss, which efficiently protects the memory hierarchy by i) selectively delaying speculative load instructions and ii) utilizing value prediction as an invisible form of speculation. In this article we dive deeper into delay-on-miss, offering insights into why and how it affects the performance of the system. We also reevaluate value prediction as an invisible form of speculation. Specifically, we focus on the implications that delaying memory loads has in the memory level parallelism of the system and how this affects the value predictor and the overall performance of the system. We present new, updated results but more importantly, we also offer deeper insight into why delay-on-miss works so well and what this means for the future of secure speculative execution.
42.	Saljooghi, Vahid, et al. (författare) Configurable RTL Model for Level-1 Caches 2012 Ingår i: Proceedings of NORCHIP, Copenhagen, Denmark, Nov. 11-12. - 9781467322218 Konferensbidrag (refereegranskat)abstract Level-1 (L1) cache memories are complex circuits that tightly integrate memory, logic, and state machines near the processor datapath. During the design of a processor-based system, many different cache configurations that vary in, for example, size, associativity, and replacement policies, need to be evaluated in order to maximize performance or power efficiency. Since the implementation of each cache memory is a time-consuming and error-prone process, a configurable and synthesizable model is very useful as it helps to generate a range of caches in a quick and reproducible manner. Comprising both a data and instruction cache, the RTL cache model that we present in this paper has a wide array of configurable parameters. Apart from different cache size parameters, the model also supports different replacement policies, associativities, and data write policies. The model is written in VHDL and fits different processors in ASICs and FPGAs. To show the usefulness of the model, we provide an example of cache configuration exploration.
43.	Sanchez, Carlos, et al. (författare) Redesigning a tagless access buffer to require minimal ISA changes 2016 Ingår i: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES 2016. - New York, NY, USA : ACM. - 9781450321389 - 9781450344821 ; , s. Article number 2968504- Konferensbidrag (refereegranskat)abstract Energy efficiency is a first-order design goal for nearly all classes of processors, but it is particularly important in mobile and embedded systems. Data caches in such systems account for a large portion of the processor's energy usage, and thus techniques to improve the energy efficiency of the cache hierarchy are likely to have high impact. Our prior work reduced data cache energy via a tagless access buffer (TAB) that sits at the top of the cache hierarchy. Strided memory references are redirected from the level-one data cache (L1D) to the smaller, more energy-efficient TAB. These references need not access the data translation lookaside buffer (DTLB), and they can avoid unnecessary transfers from lower levels of the memory hierarchy. The original TAB implementation requires changing the immediate field of load and store instructions, necessitating substantial ISA modifications. Here we present a new TAB design that requires minimal instruction set changes, gives software more explicit control over TAB resource management, and remains compatible with legacy (non-TAB) code. With a line size of 32 bytes, a four-line TAB can eliminate 31% of L1D accesses, on average. Together, the new TAB, L1D, and DTLB use 22% less energy than a TAB-less hierarchy, and the TAB system decreases execution time by 1.7%.
44.	Schilling, Thomas, et al. (författare) Scheduling for an Embedded Architecture with a Flexible Datapath 2009 Ingår i: Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI). - 9781424444083 ; , s. 151-156 Konferensbidrag (refereegranskat)
45.	Själander, Magnus, 1977, et al. (författare) A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures 2008 Ingår i: 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools, DSD 2008; Parma; Italy; 3 September 2008 through 5 September 2008. - 9780769532776 ; , s. 149-157 Konferensbidrag (refereegranskat)abstract Efficient utilization of multi-core architectures relies on the partitioning of applications into tasks and mapping the tasks to cores. In some applications (e.g. H.264 video decoding parallelized at macro-block level) these tasks have dependencies among each other. Task scheduling, consisting of selecting a task with satisfied dependencies and mapping it to a core, is typically a functionality delegated to the Operating System. In this paper we present a hardware Task Management Unit (TMU) that looks ahead in time to find tasks to be executed by a multi-core architecture. The look-ahead functionality is shown to reduce the task management overhead by 40-50\% when executing a parallelized version of an H.264 video decoder on an architecture with up to 16 cores. In overall, the TMU-based multi-core architecture reaches a speedup of more than 14x on 16 cores running H.264 video decoding, assuming CABAC is implemented in a dedicated coprocessor.
46.	Själander, Magnus, 1977, et al. (författare) A Low-Leakage Twin-Precision Multiplier Using Reconfigurable Power Gating 2005 Ingår i: IEEE International Symposium on Circuits and Systems. ; , s. 1654-7 Konferensbidrag (refereegranskat)
47.	Själander, Magnus, 1977, et al. (författare) A Power-Efficient and Versatile Modified-Booth Multiplier 2005 Ingår i: Swedish System-on-Chip Conference. Konferensbidrag (övrigt vetenskapligt/konstnärligt)
48.	Själander, Magnus, 1977, et al. (författare) An Efficient Twin-Precision Multiplier 2004 Ingår i: International Conference on Computer Design (ICCD). ; , s. 30-33 Konferensbidrag (refereegranskat)abstract We present a twin-precision multiplier that in normal operation mode efficiently performs N-b multiplications. For applications where the demand on precision is relaxed, the multiplier can perform N/2-b multiplications while expending only a fraction of the energy of a conventional N-b multiplier. For applications with high demands on throughput, the multiplier is capable of performing two independent N/2-b multiplications in parallel. A comparison between two signed 16-b multipliers, where both perform single 8-b multiplications, shows that the twin-precision multiplier has 72% lower power dissipation and 15% higher speed than the conventional one, while only requiring 8% more transistors.
49.	Själander, Magnus, 1977, et al. (författare) An LTE Uplink Receiver PHY Benchmark and Subframe-Based Power Management 2012 Ingår i: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. Konferensbidrag (refereegranskat)abstract With the proliferation of mobile phones and other mobile internet appliances, the application area of baseband processing continues to grow in importance. Much academic research addresses the underlying mathematics, but little has been published on the design of systems to execute baseband workloads. Most systems research is conducted within companies who go to great lengths to protect their intellectual property. We present an open-source LTE Uplink Receiver PHY benchmark with a realistic representation of the baseband processing of an LTE base station, and we demonstrate its usefulness in investigating resource management strategies to conserve power on a TILEPro64. By estimating the workload of each subframe and using these estimates to control power-gating, we reduce power consumption by more than 24% (11% on average) compared to executing the benchmark with no estimation-guided resource management. By making available a benchmark containing no proprietary algorithms, we enable a broader community to conduct research both in baseband processing and on the systems that are used to execute such workloads.
50.	Själander, Magnus, 1977 (författare) Efficient and Flexible Embedded Systems and Datapath Components 2008 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract The comfort of our daily lives has come to rely on a vast number of embedded systems, such as mobile phones, anti-spin systems for cars, and high-definition video. To improve the end-user experience at often stringent require-ments, in terms of high performance, low power dissipation, and low cost, makes these systems complex and nontrivial to design.This thesis addresses design challenges in three different areas of embedded systems. The presented FlexCore processor intends to improve the programmability of heterogeneous embedded systems while maintaining the performance of application-specific accelerators. This is achieved by integrating accelerators into the datapath of a general-purpose processor in combination with a wide control word consisting of all control signals in a FlexCore’s datapath. Furthermore, a FlexCore processor utilizes a flexible interconnect, which together withthe expressiveness of the wide control word improves its performance.When designing new embedded systems it is important to have efficient components to build from. Arithmetic circuits are especially important, since they are extensively used in all applications. In particular, integer multipliers present big design challenges. The proposed twin-precision technique makes it possible to improve both throughput and power of conventional integer multipliers, when computing narrow-width multiplications. The thesis also shows that the Baugh-Wooley algorithm is more suitable for hardware implementations of signed integer multipliers than the commonly used modified-Booth algorithm.A multi-core architecture is a common design choice when a single-core architecture cannot deliver sufficient performance. However, multi-core architectures introduce their own design challenges, such as scheduling applicationsonto several cores. This thesis presents a novel task management unit, which offloads task scheduling from the conventional cores of a multi-core system, thus improving both performance and power efficiency of the system.This thesis proposes novel solutions to a number of relevant issues that need to be addressed when designing embedded systems.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Träfflista för sökning "WFRF:(Själander Magnus 1977) "

Avgränsa träffmängd

År