↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Träfflista för sökning "WFRF:(Ul Abdin Zain 1975 ) "

Sökning: WFRF:(Ul Abdin Zain 1975 )

Resultat 1-42 av 42

Sortera/gruppera träfflistan

Sortering: Träffar per sida:

Numrering	Referens	Omslagsbild	Hitta
1.	Alam, Ashraful, et al. (författare) Parallelization of the Estimation Algorithm of the 3D Structure Tensor 2012 Ingår i: 2012 International Conference on Reconfigurable Computing and FPGAs, ReConFig 2012. - Piscataway, N.J. : IEEE Press. - 9781467329194 - 9781467329217 Konferensbidrag (refereegranskat)abstract The three dimensional structure tensor algorithm (3D-STA) is often used in image processing applications to compute the optical flow or to detect local 3D structures and their directions. This algorithm is computationally expensive due to many computations that are required to calculate the gradient, the tensor, and to smooth every pixel of the image frames. Therefore, it is important to parallelize the implementation to achieve high performance. In this paper we present two parallel implementations of 3D-STA; namely moderately parallelized and highly parallelized implementation, on a massively parallel reconfigurable array. Finally, we evaluate the performance of the generated code and results are compared with another optical flow implementation. The throughput achieved by the moderately parallelized implementation is approximately half of the throughput of the Optical flow implementation, whereas the highly parallelized implementation results in a 2x gain in throughput as compared to the optical flow implementation. © 2012 IEEE.
2.	Essayas, Gebrewahid, 1984-, et al. (författare) Realizing Efficient Execution of Dataflow Actors on Manycores 2014 Ingår i: 2014 12th IEEE International Conference on Embedded and Ubiquitous Computing (EUC 2014). - Los Alamitos, CA : IEEE Computer Society. ; , s. 321-328, s. 321-328 Konferensbidrag (refereegranskat)abstract Embedded DSP computing is currently shifting towards manycore architectures in order to cope with the ever growing computational demands. Actor based dataflow languages are being considered as a programming model. In this paper we present a code generator for CAL, one such dataflow language. We propose to use a compilation tool with two intermediate representations. We start from a machine model of the actors that provides an ordering for testing of conditions and firing of actions. We then generate an Action Execution Intermediate Representation that is closer to a sequential imperative language like C and Java. We describe our two intermediate representations and show the feasibility and portability of our approach by compiling a CAL implementation of the Two-Dimensional Inverse Discrete Cosine Transform on a general purpose processor, on the Epiphany manycore architecture and on the Ambric massively parallel processor array.
3.	Gebrewahid, Essayas, 1984-, et al. (författare) Actor Fission Transformations for Executing Dataflow Programs on Manycores 2017 Ingår i: 2017 Forum on Specification and Design Languages (FDL). - 9781538647332 - 9781538611524 Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract Manycore architectures are dominating the development of advanced embedded computing due to the computational and power demand of high performance applications. This has introduced an additional complexity with regard to the efficient exploitation of the underlying hardware and the development of efficient parallel implementations. To tackle this we model applications using a dataflow programming language, perform high-level transformations of dataflow actors, and generate native code by using our compilation framework.This paper presents the actor fission transformations of our Cal2Many compilation framework. The transformations have facilitated the mapping of big dataflow actors on memory restricted embedded manycores, increased the utilization of the hardware, and enabled support for task and data-level parallelism. We have applied the actor transformations to two blocks of MPEG-4 decoder and executed it on the Epiphany manycore architecture. The result shows the practicality and feasibility of our approach.
4.	Gebrewahid, Essayas, 1984-, et al. (författare) Cal2Many : A Framework to Compile Dataflow Programs for Manycores 2017 Annan publikation (övrigt vetenskapligt/konstnärligt)abstract The arrival of manycore platforms has imposed programming challenges for mainstream embedded system developers. In this paper, we discuss the significance of actor-oriented dataflow languages and present our compilation framework for CAL Actor Language that leads to increased portability and retargetability. We demonstrate the applicability of our approach with streaming applications targeting the Epiphany many-core architecture. We have performed an in-depth analysis of MPEG-4 SP implemented on Epiphany using our framework and studied the effects of actor composition. We have identified hardware aspects such as increased off-chip memory bandwidth and larger local memories that could result in further performance improvements.
5.	Gebrewahid, Essayas, 1984-, et al. (författare) Programming Real-time Image Processing for Manycores in a High-level Language 2013 Ingår i: Advanced Parallel Processing Technology. - Berlin Heidelberg : Springer Berlin/Heidelberg. - 9783642452925 ; , s. 381-395 Konferensbidrag (refereegranskat)abstract Manycore architectures are gaining attention as a means to meet the performance and power demands of high-performance embedded systems. However, their widespread adoption is sometimes constrained by the need formastering proprietary programming languages that are low-level and hinder portability. We propose the use of the concurrent programming language occam-pi as a high-level language for programming an emerging class of manycore architectures. We show how to map occam-pi programs to the manycore architecture Platform 2012 (P2012). We describe the techniques used to translate the salient features of the language to the native programming model of the P2012. We present the results from a case study on a representative algorithm in the domain of real-time image processing: a complex algorithm for corner detectioncalled Features from Accelerated Segment Test (FAST). Our results show that the occam-pi program is much shorter, is easier to adapt and has a competitive performance when compared to versions programmed in the native programming model of P2012 and in OpenCL.
6.	Gebrewahid, Essayas, 1984-, et al. (författare) Support for Data Parallelism in the CAL Actor Language 2016 Ingår i: WPMVP '16. - New York, NY : ACM Press. - 9781450340601 Konferensbidrag (refereegranskat)abstract With the arrival of heterogeneous manycores comprising various features to support task, data and instruction-level parallelism, developing applications that take full advantage of the hardware parallel features has become a major challenge. In this paper, we present an extension to our CAL compilation framework (CAL2Many) that supports data parallelism in the CAL Actor Language. Our compilation framework makes it possible to program architectures with SIMD support using high-level language and provides efficient code generation. We support general SIMD instructions but the code generation backend is currently implemented for two custom architectures, namely ePUMA and EIT. Our experiments were carried out for two custom SIMD processor architectures using two applications.The experiment shows the possibility of achieving performance comparable to hand-written machine code with much less programming effort.
7.	Gebrewahid, Essayas, 1984- (författare) Tools to Compile Dataflow Programs for Manycores 2017 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract The arrival of manycore systems enforces new approaches for developing applications in order to exploit the available hardware resources. Developing applications for manycores requires programmers to partition the application into subtasks, consider the dependence between the subtasks, understand the underlying hardware and select an appropriate programming model. This is complex, time-consuming and prone to error. In this thesis, we identify and implement abstraction layers in compilation tools to decrease the burden of the programmer, increase program portability and scalability, and increase retargetability of the compilation framework. We present compilation frameworks for two concurrent programming languages, occam-pi and CAL Actor Language, and demonstrate the applicability of the approach with application case-studies targeting these different manycore architectures: STHorm, Epiphany, Ambric, EIT, and ePUMA. For occam-pi, we have extended the Tock compiler and added a backend for STHorm. We evaluate the approach using a fault tolerance model for a four stage 1D-DCT algorithm implemented by using occam-pi's constructs for dynamic reconguration, and the FAST corner detection algorithm which demonstrates the suitability of occam-pi and the compilation framework for data-intensive applications. For CAL, we have developed a new compilation framework, namely Cal2Many. The Cal2Many framework has a front end, two intermediate representations and four backends: for a uniprocessor, Epiphany, Ambric, and a backend for SIMD based architectures. Also, we have identied and implemented of CAL actor fusion and fission methodologies for efficient mapping CAL applications. We have used QRD, FAST corner detection, 2D-IDCT, and MPEG applications to evaluate our compilation process and to analyze the limitations of the hardware.
8.	Olofsson, Andreas, et al. (författare) Kickstarting High-performance Energy-efficient Manycore Architectures with Epiphany 2014 Ingår i: Conference record. - Piscataway, NJ : IEEE Press. - 9781479982950 - 9781479982974 ; , s. 1719-1726 Konferensbidrag (refereegranskat)abstract In this paper we introduce Epiphany as a high-performance energy-efficient manycore architecture suitable for real-time embedded systems. This scalable architecture supports floating point operations in hardware and achieves 50 GFLOPS/W in 28 nm technology, making it suitable for high performance streaming applications like radio base stations and radar signal processing. Through an efficient 2D mesh Network-on-Chip and a distributed shared memory model, the architecture is scalable to thousands of cores on a single chip. An Epiphany-based open source computer named Parallella was launched in 2012 through Kickstarter crowd funding and has now shipped to thousands of customers around the world. ©2014 IEEE.
9.	Rezk, Nesma, 1987-, et al. (författare) E€iffcient Implementation of Convolution Neural Networks Inference On Manycore Architectures 2017 Konferensbidrag (refereegranskat)abstract The convolution module of convolution neural networks is highly computation demanding. In order to execute a neural network inference on embedded platforms, an ecient implementation of the convolution is required. Low precision parameters can provide an implementation that requires less memory, less computation time, and less power consumption. Nevertheless, streaming the convolution computation over parallelized processing units saves a lot of memory, which is a key concern in memory constrained embedded platforms. In this paper, we show how the convolution module can be implemented on Epiphany manycore architecture. Low precision parameters are used with ternary weights of +1, 0, and -1 values. The computation is done through a pipeline by streaming data through processing units. The proposed approach decreases the memory requirements for CNN implementation and could reach up to 282 GOPS and up to 5.6 GOPs/watt.
10.	Rezk, Nesma, 1987- (författare) Exploring Efficient Implementations of Deep Learning Applications on Embedded Platforms 2020 Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract The promising results of deep learning (deep neural network) models in many applications such as speech recognition and computer vision have aroused a need for their realization on embedded platforms. Augmenting DL (Deep Learning) in embedded platforms grants them the support to intelligent tasks in smart homes, mobile phones, and healthcare applications. Deep learning models rely on intensive operations between high precision values. In contrast, embedded platforms have restricted compute and energy budgets. Thus, it is challenging to realize deep learning models on embedded platforms.In this thesis, we define the objectives of implementing deep learning models on embedded platforms. The main objective is to achieve efficient implementations. The implementation should achieve high throughput, preserve low power consumption, and meet real-time requirements.The secondary objective is flexibility. It is not enough to propose an efficient hardware solution for one model. The proposed solution should be flexible to support changes in the model and the application constraints. Thus, the overarching goal of the thesis is to explore flexible methods for efficient realization of deep learning models on embedded platforms.Optimizations are applied to both the DL model and the embedded platform to increase implementation efficiency. To understand the impact of different optimizations, we chose recurrent neural networks (as a class of DL models) and compared its' implementations on embedded platforms. The comparison analyzes the optimizations applied and the corresponding performance to provide conclusions on the most fruitful and essential optimizations. We concluded that it is essential to apply an algorithmic optimization to the model to decrease it's compute and memory requirement, and it is essential to apply a memory-specific optimization to hide the overhead of memory access to achieve high efficiency. Furthermore, it has been revealed that many of the work understudy focus on implementation efficiency, and flexibility is less attempted.We have explored the design space of Convolutional neural networks (CNNs) on Epiphany manycore architecture. We adopted a pipeline implementation of CNN that relies on the on-chip memory solely to store the weights. Also, the proposed mapping supported both ALexNet and GoogleNet CNN models, varying precision for weights, and two memory sizes for Epiphany cores. We were able to achieve competitive performance with respect to emerging manycores.As a part of the work in progress, we have studied a DL-architecture co-design approach to increase the flexibility of hardware solutions. A flexible platform should support variations in the model and variations in optimizations. The optimization method should be automated to respond to the changes in the model and application constraints with minor effort. Besides, the mapping of the models on embedded platforms should be automated as well.
11.	Rezk, Nesma M., 1987-, et al. (författare) Shrink and Eliminate : A Study of Post-Training Quantization and Repeated Operations Elimination in RNN Models 2022 Ingår i: Information. - Basel : MDPI. - 2078-2489. ; 13:4 Tidskriftsartikel (refereegranskat)abstract Recurrent neural networks (RNNs) are neural networks (NN) designed for time-series applications. There is a growing interest in running RNNs to support these applications on edge devices. However, RNNs have large memory and computational demands that make them challenging to implement on edge devices. Quantization is used to shrink the size and the computational needs of such models by decreasing weights and activation precision. Further, the delta networks method increases the sparsity in activation vectors by relying on the temporal relationship between successive input sequences to eliminate repeated computations and memory accesses. In this paper, we study the effect of quantization on LSTM-, GRU-, LiGRU-, and SRU-based RNN models for speech recognition on the TIMIT dataset. We show how to apply post-training quantization on these models with a minimal increase in the error by skipping quantization of selected paths. In addition, we show that the quantization of activation vectors in RNNs to integer precision leads to considerable sparsity if the delta networks method is applied. Then, we propose a method for increasing the sparsity in the activation vectors while minimizing the error and maximizing the percentage of eliminated computations. The proposed quantization method managed to com-press the four models more than 85%, with an error increase of 0.6, 0, 2.1, and 0.2 percentage points, respectively. By applying the delta networks method to the quantized models, more than 50% of the operations can be eliminated, in most cases with only a minor increase in the error. Comparing the four models to each other under the quantization and delta networks method, we found that compressed LSTM-based models are the most-optimum solutions at low-error-rates constraints. The compressed SRU-based models are the smallest in size, suitable when higher error rates are acceptable, and the compressed LiGRU-based models have the highest number of eliminated operations.
12.	Rezk, Nesma, 1987-, et al. (författare) MOHAQ : Multi-Objective Hardware-Aware Quantization of recurrent neural networks 2022 Ingår i: Journal of systems architecture. - Amsterdam : Elsevier BV. - 1383-7621 .- 1873-6165. ; 133 Tidskriftsartikel (refereegranskat)abstract The compression of deep learning models is of fundamental importance in deploying such models to edge devices. The selection of compression parameters can be automated to meet changes in the hardware platform and application. This article introduces a Multi-Objective Hardware-Aware Quantization (MOHAQ) method, which considers hardware performance and inference error as objectives for mixed-precision quantization. The proposed method feasibly evaluates candidate solutions in a large search space by relying on two steps. First, post-training quantization is applied for fast solution evaluation (inference-only search). Second, we propose the "beacon-based search" to retrain selected solutions only and use them as beacons to estimate the effect of retraining on other solutions. We use speech recognition models on TIMIT dataset. Experimental evaluations show that Simple Recurrent Unit (SRU)-based models can be compressed up to 8x by post-training quantization without any significant error increase. On SiLago, we found solutions that achieve 97% and 86% of the maximum possible speedup and energy saving, with a minor increase in error on an SRU-based model. On Bitfusion, the beacon-based search reduced the error gain of the inference-only search on SRU-based models and Light Gated Recurrent Unit (LiGRU)-based model by up to 4.9 and 3.9 percentage points, respectively.
13.	Rezk, Nesma, 1987-, et al. (författare) Recurrent Neural Networks : An Embedded Computing Perspective 2020 Ingår i: IEEE Access. - Piscataway : IEEE. - 2169-3536. ; 81:1, s. 57967-57996 Tidskriftsartikel (refereegranskat)abstract Recurrent Neural Networks (RNNs) are a class of machine learning algorithms used for applications with time-series and sequential data. Recently, there has been a strong interest in executing RNNs on embedded devices. However, difficulties have arisen because RNN requires high computational capability and a large memory space. In this paper, we review existing implementations of RNN models on embedded platforms and discuss the methods adopted to overcome the limitations of embedded systems. We will define the objectives of mapping RNN algorithms on embedded platforms and the challenges facing their realization. Then, we explain the components of RNN models from an implementation perspective. We also discuss the optimizations applied to RNNs to run efficiently on embedded platforms. Finally, we compare the defined objectives with the implementations and highlight some open research questions and aspects currently not addressed for embedded RNNs. Overall, applying algorithmic optimizations to RNN models and decreasing the memory access overhead is vital to obtain high efficiency. To further increase the implementation efficiency, we point up the more promising optimizations that could be applied in future research. Additionally, this article observes that high performance has been targeted by many implementations, while flexibility has, as yet, been attempted less often. Thus, the article provides some guidelines for RNN hardware designers to support flexibility in a better manner.
14.	Rezk, Nesma, 1987-, et al. (författare) Streaming Tiles : Flexible Implementation of Convolution Neural Networks Inference on Manycore Architectures 2018 Ingår i: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). - Los Alamitos : IEEE Computer Society. - 9781538655559 - 9781538655566 ; , s. 867-876 Konferensbidrag (refereegranskat)abstract Convolution neural networks (CNN) are extensively used for deep learning applications such as image recognition and computer vision. The convolution module of these networks is highly compute-intensive. Having an efficient implementation of the convolution module enables realizing the inference part of the neural network on embedded platforms. Low precision parameters require less memory, less computation time, and less power consumption while achieving high classification accuracy. Furthermore, streaming the data over parallelized processing units saves a considerable amount of memory, which is a key concern in memory constrained embedded platforms. In this paper, we explore the design space for streamed CNN on Epiphany manycore architecture using varying precisions for weights (ranging from binary to 32-bit). Both AlexNet and GoogleNet are explored for two different memory sizes of Epiphany cores. We are able to achieve competitive performance for both Alexnet and GoogleNet with respect to emerging manycores. Furthermore, the effects of different design choices in terms of precision, memory size, and the number of cores are evaluated by applying the proposed method.
15.	Savas, Süleyman, 1986-, et al. (författare) A Configurable Two Dimensional Mesh Network-on-Chip Implementation in Chisel 2019 Annan publikation (övrigt vetenskapligt/konstnärligt)abstract On-chip communication plays a significant role in the performance of manycore architectures. Therefore, they require a proper on-chip communication infrastructure that can scale with the number of the cores. As a solution, network-on-chip structures have emerged and are being used.This paper presents description of a two dimensional mesh network-on-chip router and a network interface, which are implemented in Chisel to be integrated to the rocket chip generator that generates RISC-V (rocket) cores. The router is implemented in VHDL as well and the two implementations are verified and compared.Hardware resource usage and performance of different sized networks are analyzed. The implementations are synthesized for a Xilinx Ultrascale FPGA via Xilinx tools for the hardware resource usage and clock frequency results. The performance results including latency and throughput measurements with different traffic patterns, are collected with cycle accurate emulations. The implementations in Chisel and VHDL do not show a significant difference. Chisel requires around 10% fewer lines of code, however, the difference in the synthesis results is negligible. Our latency result are better than the majority of the other studies. The other results such as hardware usage, clock frequency, and throughput are competitive when compared to the related works.
16.	Savas, Süleyman, 1986-, et al. (författare) A framework to generate domain-specific manycore architectures from dataflow programs 2020 Ingår i: Microprocessors and microsystems. - Amsterdam : Elsevier. - 0141-9331 .- 1872-9436. ; 72 Tidskriftsartikel (refereegranskat)abstract In the last 15 years we have seen, as a response to power and thermal limits for current chip technologies, an explosion in the use of multiple and even many computer cores on a single chip. But now, to further improve performance and energy efficiency, when there are potentially hundreds of computing cores on a chip, we see a need for a specialization of individual cores and the development of heterogeneous manycore computer architectures.However, developing such heterogeneous architectures is a significant challenge. Therefore, we propose a design method to generate domain specific manycore architectures based on RISC-V instruction set architecture and automate the main steps of this method with software tools. The design method allows generation of manycore architectures with different configurations including core augmentation through instruction extensions and custom accelerators. The method starts from developing applications in a high-level dataflow language and ends by generating synthesizable Verilog code and cycle accurate emulator for the generated architecture.We evaluate the design method and the software tools by generating several architectures specialized for two different applications and measure their performance and hardware resource usages. Our results show that the design method can be used to generate specialized manycore architectures targeting applications from different domains. The specialized architectures show at least 3 to 4 times better performance than the general purpose counterparts. In certain cases, replacing general purpose components with specialized components saves hardware resources. Automating the method increases the speed of architecture development and facilitates the design space exploration of manycore architectures.
17.	Savas, Süleyman, 1986-, et al. (författare) An Evaluation of Code Generation of Dataflow Languages on Manycore Architectures 2014 Ingår i: RTCSA 2014. - Piscataway, NJ : IEEE Press. Konferensbidrag (refereegranskat)abstract Today computer architectures are shifting from single core to manycores due to several reasons such as performance demands, power and heat limitations. However, shifting to manycores results in additional complexities, especially with regard to efficient development of applications. Hence there is a need to raise the abstraction level of development techniques for the manycores while exposing the inherent parallelism in the applications. One promising class of programming languages is dataflow languages and in this paper we evaluate and optimize the code generation for one such language, CAL. We have also developed a communication library to support the inter-core communication.The code generation can target multiple architectures, but the results presented in this paper is focused on Adapteva's many core architecture Epiphany.We use the two-dimensional inverse discrete cosine transform (2D-IDCT) as our benchmark and compare our code generation from CAL with a hand-written implementation developed in C. Several optimizations in the code generation as well as in the communication library are described, and we have observed that the most critical optimization is reducing the number of external memory accesses. Combining all optimizations we have been able to reduce the difference in execution time between auto-generated and hand-written implementations from a factor of 4.3x down to a factor of only 1.3x. ©2014 IEEE.
18.	Savas, Süleyman, 1986-, et al. (författare) Dataflow Implementation of QR Decomposition on a Manycore 2016 Ingår i: MES '16. - New York, NY : ACM Press. - 9781450342629 ; , s. 26-30 Konferensbidrag (refereegranskat)abstract While parallel computer architectures have become mainstream, application development on them is still challenging. There is a need for new tools, languages and programming models. Additionally, there is a lack of knowledge about the performance of parallel approaches of basic but important operations, such as the QR decomposition of a matrix, on current commercial manycore architectures.This paper evaluates a high level dataflow language (CAL), a source-to-source compiler (Cal2Many) and three QR decomposition algorithms (Givens Rotations, Householder and Gram-Schmidt). The algorithms are implemented both in CAL and hand-optimized C languages, executed on Adapteva's Epiphany manycore architecture and evaluated with respect to performance, scalability and development effort.The performance of the CAL (generated C) implementations gets as good as 2\% slower than the hand-written versions. They require an average of 25\% fewer lines of source code without significantly increasing the binary size. Development effort is reduced and debugging is significantly simplified. The implementations executed on Epiphany cores outperform the GNU scientific library on the host ARM processor of the Parallella board by up to 30x. © 2016 Copyright held by the owner/author(s).
19.	Savas, Süleyman, 1986-, et al. (författare) Designing Domain-Specific Heterogeneous Architectures from Dataflow Programs 2018 Ingår i: Computers. - Basel : MDPI AG. - 2073-431X. ; 7:2 Tidskriftsartikel (refereegranskat)abstract The last ten years have seen performance and power requirements pushing computer architectures using only a single core towards so-called manycore systems with hundreds of cores on a single chip. To further increase performance and energy efficiency, we are now seeing the development of heterogeneous architectures with specialized and accelerated cores. However, designing these heterogeneous systems is a challenging task due to their inherent complexity. We proposed an approach for designing domain-specific heterogeneous architectures based on instruction augmentation through the integration of hardware accelerators into simple cores. These hardware accelerators were determined based on their common use among applications within a certain domain.The objective was to generate heterogeneous architectures by integrating many of these accelerated cores and connecting them with a network-on-chip. The proposed approach aimed to ease the design of heterogeneous manycore architectures—and, consequently, exploration of the design space—by automating the design steps. To evaluate our approach, we enhanced our software tool chain with a tool that can generate accelerated cores from dataflow programs. This new tool chain was evaluated with the aid of two use cases: radar signal processing and mobile baseband processing. We could achieve an approximately 4x improvement in performance, while executing complete applications on the augmented cores with a small impact (2.5–13%) on area usage. The generated accelerators are competitive, achieving more than 90% of the performance of hand-written implementations.
20.	Savas, Süleyman, 1986-, et al. (författare) Designing Domain Specific Heterogeneous Manycore Architectures Based on Building Blocks 2018 Annan publikation (övrigt vetenskapligt/konstnärligt)abstract Performance and power requirements has pushed computer architectures from single core to manycores. These requirements now continue pushing the manycores with identical cores (homogeneous) to manycores with specialized cores (heterogeneous). However designing heterogeneous manycores is a challenging task due to the complexity of the architectures. We propose an approach for designing domain specific heterogeneous manycore architectures based on building blocks. These blocks are defined as the common computations of the applications within a domain. The objective is to generate heterogeneous architectures by integrating many of these blocks to many simple cores and connect the cores with a networkon-chip. The proposed approach aims to ease the design of heterogeneous manycore architectures and facilitate usage of dark silicon concept. As a case study, we develop an accelerator based on several building blocks, integrate it to a RISC core and synthesize on a Xilinx Ultrascale FPGA. The results show that executing a hot-spot of an application on an accelerator based on building blocks increases the performance by 15x, with room for further improvement. The area usage increases as well, however there are potential optimizations to reduce the area usage. © 2018 by the authors
21.	Savas, Süleyman, 1986-, et al. (författare) Efficient Single-Precision Floating-Point Division Using Harmonized Parabolic Synthesis 2017 Ingår i: 2017 IEEE Computer Society Annual Symposium on VLSI. - Los Alamitos : IEEE. - 9781509067626 - 9781509067633 Konferensbidrag (refereegranskat)abstract This paper proposes a novel method for performing division on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is based on an inverter, implemented as a combination of Parabolic Synthesis and second-degree interpolation, followed by a multiplier. It is implemented with and without pipeline stages individually and synthesized while targeting a Xilinx Ultrascale FPGA.The implementations show better resource usage and latency results when compared to other implementations based on different methods. In case of throughput, the proposed method outperforms most of the other works, however, some Altera FPGAs achieve higher clock rate due to the differences in the DSP slice multiplier design.Due to the small size, low latency and high throughput, the presented floating-point division unit is suitable for high performance embedded systems and can be integrated into accelerators or be used as a stand-alone accelerator.
22.	Savas, Süleyman, 1986- (författare) Hardware/Software Co-Design of Heterogeneous Manycore Architectures 2019 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract In the era of big data, advanced sensing, and artificial intelligence, the required computation power is provided mostly by multicore and manycore architectures. However, the performance demand keeps growing. Thus the computer architectures need to continue evolving and provide higher performance. The applications, which are executed on the manycore architectures, are divided into several tasks to be mapped on separate cores and executed in parallel. Usually these tasks are not identical and may be executed more efficiently on different types of cores within a heterogeneous architecture. Therefore, we believe that the heterogeneous manycores are the next step for the computer architectures. However, there is a lack of knowledge on what form of heterogeneity is the best match for a given application or application domain. This knowledge can be acquired through designing these architectures and testing different design configurations. However, designing these architectures is a great challenge. Therefore, there is a need for an automated design method to facilitate the architecture design and design space exploration to gather knowledge on architectures with different configurations. Additionally, it is already difficult to program manycore architectures efficiently and this difficulty will only increase further with the introduction of heterogeneity due to the increase in the complexity of the architectures, unless this complexity is somehow hidden. There is a need for software development tools to facilitate the software development for these architectures and enable portability of the same software across different manycore platforms.In this thesis, we first address the challenges of the software development for manycore architectures. We evaluate a dataflow language (CAL) and a source-to-source compilation framework (Cal2Many) with several case studies in order to reveal their impact on productivity and performance of the software. The language supports task level parallelism by adopting actor model and the framework takes CAL code and generates implementations in the native language of several different architectures.In order to address the challenge of custom hardware development, we first evaluate a commercial manycore architecture namely Epiphany and identify its demerits. Then we study manycore architectures in order to reveal possible uses of heterogeneity in manycores and facilitate choice of architecture for software and hardware development. We define a taxonomy for manycore architectures that is based on the levels of heterogeneity they contain and discuss the benefits and drawbacks of these levels. We finally develop and evaluate a design method to design heterogeneous manycore architectures customized based on application requirements. The architectures designed with this method consist of cores with application specific accelerators. The majority of the design method is automated with software tools, which support different design configurations in order to increase the productivity of the hardware developer and enable design space exploration.Our results show that the dataflow language, together with the software development tool, decreases software development efforts significantly (25-50%), while having a small impact (2-17%) on the performance. The evaluation of the design method reveal that the performance of automatically generated accelerators is between 96-100% of the performance of their manually developed counterparts. Additionally, it is possible to increase the performance of the architectures by increasing the number of cores and using application specific accelerators, usually with a cost on the area usage. However, under certain circumstances, using accelerator may lead to avoiding usage of large general purpose components such as the floating-point unit and therefore improves the area utilization. Eventually, the final impact on the performance and area usage depends on the configurations. When compared to the Epiphany architecture, which is a commercial homogeneous manycore, the generated manycores show competitive results. We can conclude that the automated design method simplifies heterogeneous manycore architecture design and facilitates design space exploration with the use of configurable parameters.
23.	Savas, Süleyman, 1986-, et al. (författare) Using Harmonized Parabolic Synthesis to Implement a Single-Precision Floating-Point Square Root Unit 2019 Ingår i: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). - : IEEE conference proceedings. - 9781728133911 - 9781728133928 ; , s. 621-626 Konferensbidrag (refereegranskat)abstract This paper proposes a novel method for performing square root operation on floating-point numbers represented in IEEE-754 single-precision (binary32) format. The method is implemented using Harmonized Parabolic Synthesis. It is implemented with and without pipeline stages individually and synthesized for two different Xilinx FPGA boards.The implementations show better resource usage and latency results when compared to other similar works including Xilinx intellectual property (IP) that uses the CORDIC method. Any method calculating the square root will make approximation errors. Unless these errors are distributed evenly around zero, they can accumulate and give a biased result. An attractive feature of the proposed method is the fact that it distributes the errors evenly around zero, in contrast to CORDIC for instance.Due to the small size, low latency, high throughput, and good error properties, the presented floating-point square root unit is suitable for high performance embedded systems. It can be integrated into a processor’s floating point unit or be used as astand-alone accelerator. © 2019 IEEE.
24.	Savas, Süleyman, 1986- (författare) Utilizing Heterogeneity in Manycore Architectures for Streaming Applications 2017 Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract In the last decade, we have seen a transition from single-core to manycore in computer architectures due to performance requirements and limitations in power consumption and heat dissipation. The first manycores had homogeneous architectures consisting of a few identical cores. However, the applications, which are executed on these architectures, usually consist of several tasks requiring different hardware resources to be executed efficiently. Therefore, we believe that utilizing heterogeneity in manycores will increase the efficiency of the architectures in terms of performance and power consumption. However, development of heterogeneous architectures is more challenging and the transition from homogeneous to heterogeneous architectures will increase the difficulty of efficient software development due to the increased complexity of the architecture. In order to increase the efficiency of hardware and software development, new hardware design methods and software development tools are required. Additionally, there is a lack of knowledge on the performance of applications when executed on manycore architectures.The transition began with a shift from single-core architectures to homogeneous multicore architectures consisting of a few identical cores. It now continues with a shift from homogeneous architectures with identical cores to heterogeneous architectures with different types of cores specialized for different purposes. However, this transition has increased the complexity of architectures and hence the complexity of software development and execution. In order to decrease the complexity of software development, new software tools are required. Additionally, there is a lack of knowledge on what kind of heterogeneous manycore design is most efficient for different applications and what are the performances of these applications when executed on current commercial manycores.This thesis studies manycore architectures in order to reveal possible uses of heterogeneity in manycores and facilitate choice of architecture for software and hardware developers. It defines a taxonomy for manycore architectures that is based on the levels of heterogeneity they contain and discusses benefits and drawbacks of these levels. Additionally, it evaluates several applications, a dataflow language (CAL), a source-to-source compilation framework (Cal2Many), and a commercial manycore architecture (Epiphany). The compilation framework takes implementations written in the dataflow language as input and generates code targetting different manycore platforms. Based on these evaluations, the thesis identifies the bottlenecks of the architecture. It finally presents a methodology for developing heterogeneoeus manycore architectures which target specific application domains.Our studies show that using different types of cores in manycore architectures has the potential to increase the performance of streaming applications. If we add specialized hardware blocks to a core, the performance easily increases by 15x for the target application while the core size increases by 40-50% which can be optimized further. Other results prove that dataflow languages, together with software development tools, decrease software development efforts significantly (25-50%) while having a small impact (2-17%) on the performance.
25.	Svensson, Bertil, 1948-, et al. (författare) A running leap for embedded signal processing to future parallel platforms 2014 Ingår i: WISE 2014 - Proceedings of the 2014 ACM International Workshop on Long-Term Industrial Collaboration on Software Engineering, Co-located with ASE 2014. - New York, NY, USA : Association for Computing Machinery, Inc. - 9781450330459 ; , s. 35-42 Konferensbidrag (refereegranskat)abstract This paper highlights the collaboration between industry and academia in research. It describes more than two decades of intensive development and research of new hardware and software platforms to support innovative, high-performance sensor systems with extremely high demands on embedded signal processing capability. The joint research can be seen as the run before a necessary jump to a new kind of computational platform based on parallelism. The collaboration has had several phases, starting with a focus on hardware, then on efficiency, later on software development, and finally on taking the jump and understanding the expected future. In the first part of the paper, these phases and their respective challenges and results are described. Then, in the second part, we reflect upon the motivation for collaboration between company and university, the roles of the partners, the experiences gained and the long-term effects on both sides.
26.	Tahir, Madiha, et al. (författare) Enhancing the HEVC Video Analyzer for Medical Diagnostic Videos 2015 Ingår i: 2015 12th International Conference on High-capacity Optical Networks and Enabling/Emerging Technologies (HONET). - [S.l.] : IEEE. - 9781467392686 - 9781467392679 ; , s. 65-69 Konferensbidrag (refereegranskat)abstract Video analyzers are employed to perform an in depth analysis of coding decisions undertaken during the execution of a video codec. Medical diagnostic videos, which are typically dealt with in telemedicine scenarios need careful examination to incorporate the most optimum coding decisions. This paper deals with the enhancement of an open-source video stream analyzer to facilitate codec development tailored for medical diagnostic videos. The proposed extensions include visual representation of quantitative information for the bit count used at CTU level, as well as displaying the different mode decisions adopted in the case of merge mode, prediction mode, and intra mode. We have incorporated the said extensions in HEVC analyzer and validated the approach by using test video sequences for Ultrasound, Eye, and Skin examination. © 2015 IEEE.
27.	Ul-Abdin, Zain, 1975-, et al. (författare) A Radar Signal Processing Case Study for Dataflow Programming of Manycores 2017 Ingår i: Journal of Signal Processing Systems. - New York : Springer Science and Business Media LLC. - 1939-8018 .- 1939-8115. ; 87:1, s. 49-62 Tidskriftsartikel (refereegranskat)abstract The successful realization of next generation radar systems have high performance demands on the signal processing chain. Among these are advanced Active Electronically Scanned Array (AESA) radars in which complex calculations are to be performed on huge sets of data in real-time. Manycore architectures are designed to provide flexibility and high performance essential for such streaming applications. This paper deals with the implementation of compute-intensive parts of AESA radar signal processing chain in a high-level dataflow language; CAL. We evaluate the approach by targeting a commercial manycore architecture, Epiphany, and present our findings in terms of performance and productivity gains achieved in this case study. The comparison of the performance results with the reference sequential implementations executing on a state-of-the-art embedded processor show that we are able to achieve a speedup of 1.6x to 4.4x by using only 10 cores of Epiphany.
28.	Ul-Abdin, Zain, 1975-, et al. (författare) A Retargetable Compilation Framework for Heterogeneous Reconfigurable Computing 2016 Ingår i: ACM Transactions on Reconfigurable Technology and Systems. - New York, NY : ACM Special Interest Group on Computer Science Education. - 1936-7406 .- 1936-7414. ; 9:4 Tidskriftsartikel (refereegranskat)abstract The future trend in microprocessors for the more advanced embedded systems is focusing on massively parallel reconfigurable architectures, consisting of heterogeneous ensembles of hundreds of processing elements communicating over a reconfigurable interconnection network. However, the mastering of low-level micro-architectural details involved in programming of such massively parallel platforms becomes too cumbersome, which limits their adoption in many applications. Thus there is a dire need of an approach to produce high-performance scalable implementations that harness the computational resources of the emerging reconfigurable platforms.This paper addresses the grand challenge of accessibility of these diverse reconfigurable platforms by suggesting the use of a high-level language, occam-pi, and developing a complete design flow for building, compiling, and generating machine code for heterogeneous coarse-grained hardware. We have evaluated the approach by implementing complex industrial case studies and three common signal processing algorithms. The results of the implemented case-studies suggest that the occam-pi language based approach, because of its well-defined semantics for expressing concurrency and reconfigurability, simplifies the development of applications employing run-time reconfigurable devices. The associated compiler framework ensures portability as well as the performance benefits across heterogeneous platforms.
29.	Ul-Abdin, Zain, 1975-, et al. (författare) A Study of Design Efficiency with a High-Level Language for FPGAs 2007 Ingår i: Proceedings of the 14th International Reconfigurable Architectures Workshop (RAW'07). - Piscataway, N.J. : IEEE. - 1424409101 ; , s. 1-7 Konferensbidrag (refereegranskat)abstract Over the years reconfigurable computing devices such as FPGAs have evolved from gate-level glue logic to complex reprogrammable processing architectures. However, the tools used for mapping computations to such architectures still require the knowledge about architectural details of the target device to extract efficiency. A study of the Mobius language and tools is presented in this paper, with a focus on generated hardware performance. A number of streaming and memory-intensive applications have been developed and the results have been compared with the corresponding implementations in VHDL and a behavioral hardware description language. Based upon experimental evidences, it is concluded that Mobius, a minimal parallel processing language targeted for reconfigurable architectures, enhances productivity in terms of design time and code maintainability without considerably compromising performance and resources.
30.	Ul-Abdin, Zain, 1975-, et al. (författare) An Evaluation of High-Performance Embedded Processing on MPPAs 2013 Ingår i: Proceedings. - Los Alamitos, California : IEEE Computer Society. - 9780769549699 - 9781467360050 - 9781467360050 ; , s. 235-235 Konferensbidrag (refereegranskat)abstract Embedded signal processing is facing the challenges of increased performance as well as to achieve energy efficiency. Massively parallel processor arrays (MPPAs) consisting of tens or hundreds of processing cores offer the possibility of meeting the growing performance demand in an energy efficient way by exploiting parallelism instead of scaling the clock frequency of a single powerful processor.In this paper, we evaluate two variants of MPPAs by implementing a significantly large case study, namely an autofocus criterion calculation, which is a key component in modern synthetic aperture radar systems. The implementation results from the two target architectures are compared on the basis of utilized resources, performance, and energy efficiency. The Ambric implementations demonstrate the usefulness of occam-pi based high-level language approach in utilizing hundreds of processors, whereas the Epiphany implementation reveals that energy-efficiency can be improved even further by a factors of 2-3 with respect to the Ambric implementations and can be achieved at high clock speeds. © 2013 IEEE.
31.	Ul-Abdin, Zain, 1975-, et al. (författare) Dataflow Programming of Real-time Radar Signal Processing on Manycores 2014 Ingår i: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP). - Piscataway, NJ : IEEE Press. - 9781479970889 - 9781479970896 - 9781479970872 ; , s. 15-19 Konferensbidrag (refereegranskat)abstract Real-time performance is critical for the successful realization of next generation radar systems. Among these are advanced Active Electronically Scanned Array (AESA) radars in which complex calculations are to be performed on huge sets of data in real-time. Manycore architectures are designed to provide flexibility and high performance essential for such streaming applications.This paper deals with the implementation of compute-intensive parts of AESA radar signal processing chain in a high-level dataflow language; CAL. We evaluate the approach by targeting a commercial manycore architecture, Epiphany, and present our preliminary findings in terms of performance and productivity gains achieved in this case study. © 2014 IEEE.
32.	Ul-Abdin, Zain, 1975-, et al. (författare) Energy-Efficient Synthetic-Aperture Radar Processing on a Manycore Architecture 2013 Ingår i: Proceedings. - Piscataway, NJ : IEEE conference proceedings. - 9780769551173 - 9781479914487 ; , s. 330-338 Konferensbidrag (refereegranskat)abstract The next generation radar systems have high performance demands on the signal processing chain. Examples include the advanced image creating sensor systems in which complex calculations are to be performed on huge sets of data in realtime. Manycore architectures are gaining attention as a means to overcome the computational requirements of the complex radar signal processing by exploiting massive parallelism inherent in the algorithms in an energy efficient manner.In this paper, we evaluate a manycore architecture, namely a 16-core Epiphany processor, by implementing two significantly large case studies, viz. an autofocus criterion calculation and the fast factorized back-projection algorithm, both key componentsin modern synthetic aperture radar systems. The implementation results from the two case studies are compared on the basis of achieved performance and programmability. One of the Epiphany implementations demonstrates the usefulness of the architecture for the streaming based algorithm (the autofocus criterion calculation) by achieving a speedup of 8.9x over a sequential implementation on a state-of-the-art general-purpose processor of a later silicon technology generation and operating at a 2.7x higher clock speed. On the other case study, a highly memory-intensive algorithm (fast factorized backprojection), the Epiphany architecture shows a speedup of 4.25x. For embedded signal processing, low power dissipation is equally important as computational performance. In our case studies, the Epiphany implementations of the two algorithms are, respectively, 78x and 38x more energy efficient. © 2013 IEEE
33.	Ul-Abdin, Zain, 1975-, et al. (författare) Evaluating Video Codecs for Telemedicine Under Very-Low Bitrates 2015 Ingår i: 2015 8th International Congress on Image and Signal Processing (CISP). - Piscataway, NJ : IEEE. - 9781467390989 - 9781467390972 - 9781467390996 ; , s. 98-103 Konferensbidrag (refereegranskat)abstract Telemedicine is drawing greater attention to improve the health care delivery. Video coding being an integral part of any real-time telemedicine system is used to deliver diagnostic video stream to remote physician. Realizing a video coding system customized for telemedicine, using the available technologies poses several challenges. In this paper, we have analyzed state-of-the-art video codecs for adoption in telemedicine customized video conferencing system under low-bandwidth and low-computational scenarios. The experimental results for the selected video codec implementations for medical videos reveal that the HEVC encoder achieves equivalent objective video quality when using approx. 60% bit rate on average. However, the gain in coding efficiency is at the expense of increased computational complexity, which could be dealt with by incorporating adaptive interpolation and selective quality enhancement techniques to achieve real-time performance. © 2015 IEEE.
34.	Ul-Abdin, Zain, 1975-, et al. (författare) Managing Dynamic Reconfiguration for Fault-tolerance on a Manycore Architecture 2012 Ingår i: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012. - New York, USA : IEEE Computer Society. ; , s. 312-319 Konferensbidrag (refereegranskat)abstract With the advent of manycore architectures comprising hundreds of processing elements, fault management has become a major challenge. We present an approach that uses the occam-pi language to manage the fault recovery mechanism on a new manycore architecture, the Platform 2012 (P2012). The approach is made possible by extending our previously developed compiler framework to compile occam-pi implementations to the P2012 architecture. We describe the techniques used to translate the salient features of the occam-pi language to the native programming model of the P2012 architecture. We demonstrate the applicability of the approach by an experimental case study, in which the DCT algorithm is implemented on a set of four processing elements. During run-time, some of the tasks are then relocated from assumed faulty processing elements to the faultless ones by means of dynamic reconfiguration of the hardware. The working of the demonstrator and the simulation results illustrate not only the feasibility of the approach but also how the use of higher-level abstractions simplifies the fault handling. © 2012 IEEE.
35.	Ul-Abdin, Zain, 1975-, et al. (författare) Occam-pi as a High-level Language for Coarse-Grained Reconfigurable Architectures 2011 Ingår i: IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. - Washington, USA : IEEE Computer Society. - 9781612844251 - 9780769543857 ; , s. 236-243 Konferensbidrag (refereegranskat)abstract Recently we proposed occam-pi as a high-levellanguage for programming coarse grained reconfigurable architectures. The constructs of occam-pi combine ideas from CSPand pi-calculus to facilitate expressing parallelism, communication, and reconfigurability. The feasability of this approachwas illustrated by developing a compiler framework to compile occam-pi implementations to the Ambric architecture. In this paper, we demonstrate the applicability of occam-pif or programing an array of functional units, eXtreme ProcessingPlatform (XPP). This is made possible by extending the compilerframework to target the XPP architecture, including automatic floating to fixed-point conversion. Different implementations of a FIR filter and a DCT algorithm were developed and evaluated on the basis of performance and resource consumption. The reported results reveal that the approach of using occam-pito program the category of coarse grained reconfigurable architectures appears to be promising. The resulting implementations are generally much superior to those programmed in C and comparable to those hand-coded in the low-level native language NML.
36.	Ul-Abdin, Zain, 1975-, et al. (författare) Occam-pi for Programming of Massively Parallel Reconfigurable Architectures 2012 Ingår i: International Journal of Reconfigurable Computing. - New York : Hindawi Publishing Corporation. - 1687-7195 .- 1687-7209. ; 2012 Tidskriftsartikel (refereegranskat)abstract Massively parallel reconfigurable architectures, which offer massive parallelism coupled with the capability of undergoing run-time reconfiguration, are gaining attention in order to meet the increased computational demands of high-performance embedded systems. We propose that the occam-pi language is used for programming of the category of massively parallel reconfigurable architectures. The salient properties of the occam-pi language are explicit concurrency with built-in mechanisms for interprocessor communication, provision for expressing dynamic parallelism, support for the expression of dynamic reconfigurations, and placement attributes. To evaluate the programming approach, a compiler framework was extended to support the language extensions in the occam-pi language and a backend was developed to target the Ambric array of processors. We present two case-studies; DCT implementation exploiting the reconfigurability feature of occam-pi and a significantly large autofocus criterion calculation based on the dynamic parallelism capability of the occam-pi language. The results of the implemented case studies suggest that the occam-pi -language-based approach simplifies the development of applications employing run-time reconfigurable devices without compromising the performance benefits. Copyright © 2012 Zain-ul-Abdin and Bertil Svensson.
37.	Ul-Abdin, Zain, 1975- (författare) Programming of coarse-grained reconfigurable architectures 2011 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract Coarse-grained reconfigurable architectures, which offer massive parallelism coupled with the capability of undergoing run-time reconfiguration, are gaining attention in order to meet not only the increased computational demands of high-performance embedded systems, but also to fulfill the need of adaptability to functional requirements of the application. This thesis focuses on the programming aspects of such coarse-grained reconfigurable computing devices, including the relevant computation models that are capable of exposing different kinds of parallelism inherent in the application and the ability of these models to capture the adaptability requirements of the application. The thesis suggests the occam-pi language for programming of a broad class of coarse-grained reconfigurable architectures as an intermediate language; we call it intermediate, since we believe that the applicationprogramming is best done in a high-level domain-specific language. The salient properties of the occam-pi language are explicit concurrency with built-in mechanisms for interprocessorcommunication, provision for expressing dynamic parallelism, support for the expression of dynamic reconfigurations, and placement attributes. To evaluate the programming approach, a compiler framework was extended to support the language extensions in the occam-pi language, and backends were developed to target two different coarse-grained reconfigurable architectures. XPP and Ambric. The results on XPP reveal that the occam-pi based implementations produce comparable throughput to those of NML programs, while programming at a much higher level of abstraction than that of NML. Similarly the two occam-pi implementations of autofocus criterion calculation targeted to the Ambric platform outperform the CPU implementation by factors of 11-23. Thus, the results of the implemented case-studies suggest that the occam-pi language based approach simplifies the development of applications employing run-time reconfigurable devices without compromising the performance benefits.
38.	Ul-Abdin, Zain, 1975-, et al. (författare) Real-time Radar Signal Processing on Massively Parallel Processor Arrays 2013 Ingår i: Conference Record of The Forty-Seventh Asilomar Conference on Signals, Systems & Computers. - Piscataway, NJ : IEEE Signal Processing Society. - 9781479923908 - 9781479923885 ; , s. 1810-1814 Konferensbidrag (refereegranskat)abstract The next generation radar systems have high performance demands on the signal processing chain. Among these are advanced image creating sensor systems in which complex calculations are to be performed on huge sets of data in realtime. Massively Parallel Processor Arrays (MPPAs) are gaining attention to cope with the computational requirements of complex radar signal processing by exploiting the massive parallelism inherent in the algorithms in an energy efficient manner.In this paper, we evaluate two such massively parallel architectures, namely, Ambric and Epiphany, by implementing a significantly large case study of autofocus criterion calculation, which is a key component in future synthetic aperture radar systems. The implementation results from the two case studies are compared on the basis of achieved performance, energy efficiency, and programmability. ©2013 IEEE.
39.	Ul-Abdin, Zain, 1975-, et al. (författare) Synthetic-Aperture Radar Processing on a Manycore Architecture 2012 Konferensbidrag (refereegranskat)abstract Synthetic-Aperture Radar (SAR) systems that are used to create high-resolution radar images from low-resolution aperture data require high computational performance. Manycore architectures are emerging to overcome the computational requirements of the complex radar signal processing.In this paper, we evaluate a manycore architecture namely Epiphany by implementing two significantly large case studies of fast factorized back-projection and autofocus criterion calculation, which are key components in modern synthetic aperture radar systems. The implementation results from the two case studies are compared on the basis of utilized resources and performance. The Epiphany implementations demonstrate the usefulness of the architecture for the streaming algorithm (autofocus criterion calculation) by achieving speedup of 8.9x with respect to the sequential implementation on Intel Core i7 processor while operating at a lower clock speed. On the other hand, for the memory-intensive algorithm (fast factorized back-projection), the Epiphany architecture shows moderate speedup in the order of 4.25x. The Epiphany implementations of the two algorithms are, respectively, 38x and 78x more energy-efficient.
40.	Ul-Abdin, Zain, 1975-, et al. (författare) Towards Teaching Embedded Parallel Computing : An Analytical Approach 2015 Ingår i: Workshop on Computer Architecture Education, WCAE 2015. - New York, NY, USA : ACM. - 9781450337175 Konferensbidrag (refereegranskat)abstract Embedded electronic systems are finding increased applications in our daily life. In order to meet the application demands in embedded systems, parallel computing is used. This paper emphasizes teaching of the specific issues of parallel computing that are critical to embedded systems. We propose an analytical approach to deliver declarative and functioning knowledge for learning in the field of computer science and engineering with a special focus on Embedded Parallel Computing (EPC). We describe the teaching of a course with a focus on how parallel computing can be used to enhance performance and improve energy efficiency of embedded systems. The teaching methods include interactive lectures with web-based course literature, seminars, and lab exercises and home-assigned practical tasks. Further, the course is intended to give a general insight into current research and development in regard to parallel architectures and computation models. Since the course is an advanced level course, the students are expected to have a basic knowledge about the fundamentals of computer architecture and their common programming methodologies. The course puts emphasis on hands-on experience with embedded parallel computing. Therefore it includes an extensive laboratory and project part, in which a state of the art manycore embedded computing system is used. We believe that undertaking these methods in succession will prepare the students for both research as well as professional career. © 2015 ACM.
41.	Xypolitidis, Benard, et al. (författare) Towards Architectural Design Space Exploration for Heterogeneous Manycores 2016 Ingår i: Proceedings. - Piscataway, NJ : IEEE Computer Society. - 9781467387750 ; , s. 805-810 Konferensbidrag (refereegranskat)abstract Today many of the high performance embedded processors already contain multiple processor cores and we see heterogeneous manycore architectures being proposed. Therefore it is very desirable to have a fast way to explore various heterogeneous architectures through the use of an architectural design space exploration tool, giving the designer the option to explore design alternatives before the physical implementation. In this paper, we have extended Heracles, a design space exploration tool for (homogeneous) manycore architectures, to incorporate different types of processing cores, and thus allowus to model heterogeneity. Our tool, called the Heterogeneous Heracles System (HHS), can besides the already supported MIPS core also include OpenRISC cores. The new tool retains the possibility available in Heracles to perform register transfer level (RTL) simulations of each explored architecture in Verilog as well as synthesizing it to field-programmable gate arrays (FPGAs). To facilitate the exploration of heterogeneous architectures, we have also extended the graphical user interface (GUI) to support heterogeneity. This GUI provides options to configure the types of core, core settings, memory system and network topology. Some initial results on FPGA utilization are presented from synthesizing both homogeneous and heterogeneous manycore architectures, as well as some benchmark results from both simulated and synthesized architectures.
42.	Yang, Mingkun, 1990-, et al. (författare) A Communication Library for Mapping Dataflow Applications on Manycore Architectures 2013 Ingår i: Proceedings of the 6th Swedish Multicore Computing Workshop. ; , s. 65-68 Konferensbidrag (refereegranskat)abstract Dataflow programming is a promising paradigm for high performance embedded parallel computing. When mapping a dataflow program onto a manycore architecture a key component is the library to express the communication between the actors. In this paper we present a dataflow communication library supporting the CAL actor language. A first implementation of the communication library is created for Adapteva’s manycore architecture Epiphany that contains an onchip 2-D mesh network. Three different buffering methods, with and without direct memory access (DMA) transfer, have been implemented and evaluated. We have also made a preliminary study on the effect of mapping strategies of the actors onto the cores. The assessment of the library is based on a CAL implementation of a two dimensional inverse discrete cosine transform (2D-IDCT) and our own CAL-to-C compilation framework. As expected the results show that the most efficient actor to-core mapping strategy is to keep the communication to the nearest neighbor communication pattern as much as possible. Thus, the best way to place a pipelined sequence of computations like our 2D-IDCT is to place the actors into cores in a serpentine fashion. For this application we found that the simple receiver side buffer outperforms the more complicated buffering strategies that used DMA transfer.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Resultat 1-42 av 42

Avgränsa träffmängd

Typ av publikation: konferensbidrag (26); tidskriftsartikel (8); annan publikation (3); doktorsavhandling (3); licentiatavhandling (2)

Typ av innehåll: refereegranskat (33); övrigt vetenskapligt/konstnärligt (9)

Författare/redaktör: Ul-Abdin, Zain, 1975 ... (38); Svensson, Bertil, 19 ... (14); Nordström, Tomas, 19 ... (13); Savas, Süleyman, 198 ... (12); Gebrewahid, Essayas, ... (8); Rezk, Nesma, 1987- (5); visa fler...; Yang, Mingkun, 1990- (4); Gaspes, Veronica, 19 ... (4); Ul-Abdin, Zain, Asso ... (3); Purnaprajna, Madhura (3); Qadir, Muhammad Abdu ... (2); Nordström, Tomas (2); Åhlander, Anders (2); Nordström, Tomas, Pr ... (2); Hemani, Ahmed, 1961- (1); Grahn, Håkan, Profes ... (1); Aksoy, Eren, 1982- (1); Nurmi, Jari, Profess ... (1); Alam, Ashraful (1); Stathis, Dimitrios (1); Janneck, Jörn (1); Arslan, Mehmet Ali (1); Karlsson, Andreas, 1 ... (1); Bengtsson, Jerker (1); Svensson, Bertil, Pr ... (1); Åhlander, Anders, 19 ... (1); Mattavelli, Marco (1); Yang, Albert Mingkun (1); Stranneby, Dag, Prof ... (1); Cedersjö, Gustav (1); Gaspes, Veronica, As ... (1); Hoang Bengtsson, Hoa ... (1); Essayas, Gebrewahid, ... (1); Jego, Bruno (1); Lavigueur, Bruno (1); Robart, Mathieu (1); Gaspes, Veronica, As ... (1); Svensson, Bertil, Se ... (1); Hertz, Erik, 1956- (1); Olofsson, Andreas (1); Raase, Sebastian, 19 ... (1); Ul-Abdin, Zain, Lect ... (1); Hemani, Ahmed, Profe ... (1); Rezk, Nesma M., 1987 ... (1); Yassin, Atwa (1); Shafique, Muhammad (1); Ericsson, Per M. (1); Tahir, Madiha (1); Porrmann, Mario, Dr. ... (1); Xypolitidis, Benard (1); visa färre...

Lärosäte: Högskolan i Halmstad (42); Umeå universitet (4); Örebro universitet (3); Lunds universitet (3); Kungliga Tekniska Högskolan (1); Uppsala universitet (1); visa fler...; Linköpings universitet (1); RISE (1); visa färre...

Språk: Engelska (42)

Forskningsämne (UKÄ/SCB): Teknik (38); Naturvetenskap (11)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

LIBRIS.kb.se

Stäng

Kopiera och spara länken för att återkomma till aktuell vy