↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Träfflista för sökning "WFRF:(Tajammul Muhammad Adeel) "

Search: WFRF:(Tajammul Muhammad Adeel)

Result 1-10 of 11

Sort/group result

Sort by: Hits per page:

Enumeration	Reference	Cover	Find
1.	Farahini, Nasim, et al. (author) 39.9 GOPs/watt multi-mode CGRA accelerator for a multi-standard basestation 2013 In: 2013 IEEE International Symposium on Circuits and Systems (ISCAS). - : IEEE. - 9781467357609 ; , s. 1448-1451 Conference paper (peer-reviewed)abstract This paper presents an industrial case study of using a Coarse Grain Reconfigurable Architecture (CGRA) for a multi-mode accelerator for two kernels: FFT for the LTE standard and the Correlation Pool for the UMTS standard to be executed in a mutually exclusive manner. The CGRA multi-mode accelerator achieved computational efficiency of 39.94 GOPS/watt (OP is multiply-add) and silicon efficiency of 56.20 GOPS/mm2. By analyzing the code and inferring the unused features of the fully programmable solution, an in-house developed tool was used to automatically customize the design to run just the two kernels and the two efficiency metrics improved to 49.05 GOPS/watt and 107.57 GOPS/mm2. Corresponding numbers for the ASIC implementation are 63.84 GOPS/watt and 90.91 GOPS/mm2. Though the ASIC’s silicon and computational efficiency numbers are slightly better, the engineering efficiency of the pre-verified/characterized CGRA solution is at least 10X better than the ASIC solution.
2.	Shami, Muhammad Ali, et al. (author) Configurable FFT Processor Using Dynamically Reconfigurable Resource Arrays 2019 In: Journal of Signal Processing Systems. - : SPRINGER. - 1939-8018 .- 1939-8115. ; 91:5, s. 459-473 Journal article (peer-reviewed)abstract This paper presents results of using a Coarse Grain Reconfigurable Architecture called DRRA (Dynamically Reconfigurable Resource Array) for FFT implementations varying in order and degree of parallelism using radix-2 decimation in time (DIT). The DRRA fabric is extended with memory architecture to be able to deal with data-sets much larger than what can be accommodated in the register files of DRRA. The proposed implementation scheme is generic in terms of the number of FFT point, the size of memory and the size of register file in DRRA. Two implementations (DRRA-1 and DRRA-2) have been synthesized in 65 nm technology and energy/delay numbers measured with post-layout annotated gate level simulations. The results are compared to other Coarse Grain Reconfigurable Architectures (CGRAs), and dedicated FFT processors for 1024 and 2048 point FFT. For 1024 point FFT, in terms of FFT operations per unit energy, DRRA-1 and DRRA-2 outperforms all CGRA by at least 2x and is worse than ASIC by 3.45x. However, in terms of energy-delay product DRRA-2 outperforms CGRAs by at least 1.66x and dedicated FFT processors by at least 10.9x. For 2048-point FFT, DRRA-1 and DRRA-2 are 10x better for energy efficiency and 94.84 better for energy-delay product. However, radix-2 implementation is worse by 9.64x and 255x in terms of energy efficiency and energy-delay product when compared against a radix-2(4) implementation.
3.	Tajammul, Muhammad Adeel, 1982-, et al. (author) A NoC based distributed memory architecture with programmable and partitionable capabilities 2010 In: 28th Norchip Conference, NORCHIP 2010. - 9781424489732 Conference paper (peer-reviewed)abstract The paper focuses on the design of a Network-on-chip based programmable and partitionable distributed memory architecture which can be integrated with a Coarse Grain Reconfigurable Architecture (CGRA). The proposed interconnect enables better interaction between computation fabric and memory fabric. The system can modify its memory to computation element ratio at runtime. The extensive capabilities of the memory system are analyzed by interfacing it with a Dynamically Reconfigurable Resource Array (DRRA), a CGRA. The interconnect can provide multiple interfaces which supports upto 8 GB/s per interface.
4.	Tajammul, Muhammad Adeel, 1982-, et al. (author) NoC Based Distributed Partitionable Memory System for a Coarse Grain Reconfigurable Architecture 2011 In: 24th Annual Conference on VLSI Design. - : IEEE Computer Society. - 9780769543482 ; , s. 232-237 Conference paper (peer-reviewed)
5.	Tajammul, Muhammad Adeel, et al. (author) Segmented bus based path setup scheme for a distributed memory architecture 2012 In: Proceedings - IEEE 6th International Symposium on Embedded Multicore SoCs, MCSoC 2012. - : IEEE. ; , s. 67-74 Conference paper (peer-reviewed)abstract This paper proposes a composite instruction for path setup and partitioning of a network on chip using segmented buses. The network connects a distributed memory to a coarse grained reconfigurable architecture. The scheme decreases the partitioning and routing instruction in sequencers (S) for the nodes (N) from Nx3 to a single instruction. This reduction in instruction also bear a small performance benefit as less instructions are scheduled onto the network. Furthermore, it is possible to optimizing the system under application specificconstraints. A simple use-case with experiments is defined to show for design trade-offs for these optimization decisions.
6.	Farahini, Nasim, et al. (author) Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric 2014 In: Microprocessors and microsystems. - : Elsevier BV. - 0141-9331 .- 1872-9436. ; 38:8, s. 788-802 Journal article (peer-reviewed)abstract This paper presents a hardware based solution for a scalable runtime address generation scheme for DSP applications mapped to a parallel distributed coarse grain reconfigurable computation and storage fabric. The scheme can also deal with non-affine functions of multiple variables that typically correspond to multiple nested loops. The key innovation is the judicious use of two categories of address generation resources. The first category of resource is the low cost AGU that generates addresses for given address bounds for affine functions of up to two variables. Such low cost AGUs are distributed and associated with every read/write port in the distributed memory architecture. The second category of resource is relatively more complex but is also distributed but shared among a few storage units and is capable of handling more complex address generation requirements like dynamic computation of address bounds that are then used to configure the AGUs, transformation of non-affine functions to affine function by computing the affine factor outside the loop, etc. The runtime computation of the address constraints results in negligibly small overhead in latency, area and energy while it provides substantial reduction in program storage, reconfiguration agility and energy compared to the prevalent pre-computation of address constraints. The efficacy of the proposed method has been validated against the prevalent address generation schemes for a set of six realistic DSP functions. Compared to the pre-computation method, the proposed solution achieved 75% average code compaction and compared to the centralized runtime address generation scheme, the proposed solution achieved 32.7% average performance improvement.
7.	Jafri, Syed Mohammad Asad Hassan, et al. (author) Energy-Aware-Task-Parallelism for Efficient Dynamic Voltage, and Frequency Scaling, in CGRAs 2013 In: Proceedings - 2013 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, IC-SAMOS 2013. - : IEEE. - 9781479901036 ; , s. 104-112 Conference paper (peer-reviewed)abstract Today, coarse grained reconfigurable architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Each application itself is composed of multiple tasks, spatially mapped to different parts of platform. Providing worst-case operating point to all applications leads to excessive energy and power consumption. To cater this problem, dynamic voltage and frequency scaling (DVFS) is a frequently used technique. DVFS allows to scale the voltage and/or frequency of the device, based on runtime constraints. Recent research suggests that the efficiency of DVFS can be significantly enhanced by combining dynamic parallelism with DVFS. The proposed methods exploit the speedup induced by parallelism to allow aggressive frequency and voltage scaling. These techniques, employ greedy algorithm, that blindly parallelizes a task whenever required resources are available. Therefore, it is likely to parallelize a task(s) even if it offers no speedup to the application, thereby undermining the effectiveness of parallelism. As a solution to this problem, we present energy aware task parallelism. Our solution relies on a resource allocation graphs and an autonomous parallelism, voltage, and frequency selection algorithm. Using resource allocation graph, as a guide, the autonomous parallelism, voltage, and frequency selection algorithm parallelizes a task only if its parallel version reduces overall application execution time. Simulation results, using representative applications (MPEG4, WLAN), show that our solution promises better resource utilization, compared to greedy algorithm. Synthesis results (using WLAN) confirm a significant reduction in energy (up to 36%), power (up to 28%), and configuration memory requirements (up to 36%), compared to state of the art.
8.	Jafri, Syed Mohammad Asad Hassan, et al. (author) Polymorphic Configuration Architecture for CGRAs 2016 In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems. - : IEEE. - 1063-8210 .- 1557-9999. ; 24:1, s. 403-407 Journal article (peer-reviewed)abstract In the era of platforms hosting multiple applications with arbitrary reconfiguration requirements, static configuration architectures are neither optimal nor desirable. The static reconfiguration architectures either incur excessive overheads or cannot support advanced features (like time-sharing and runtime parallelism). As a solution to this problem, we present a polymorphic configuration architecture (PCA) that provides each application with a configuration infrastructure tailored to its needs.
9.	Tajammul, Muhammad Adeel, et al. (author) DyMeP : An Infrastructure to Support Dynamic Memory Binding for Runtime Mapping in CGRAs 2015 In: Proceedings of the IEEE International Conference on VLSI Design. - : IEEE conference proceedings. - 9781479966585 ; , s. 547-552 Conference paper (peer-reviewed)abstract Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications. Commonly, CGRAs are composed of a computation layer (that performs computations) and a memory layer (that provides data and config ware to the computation layer). Tempted by higher platform utilization and reliability, recently proposed CGRA soffer dynamic application remapping (for the computation layer). Distributed scratch pad (compiler programmed) memories offer high data rates, predictability and low the power consumption (compared to caches). Therefore, the distributed scratchpad memories are emerging as preferred implementation alternative for the memory layer in recent CGRAs. However, the scratchpad memories are programmed at compile time, and do not support dynamic application remapping. The existing solutions that allow dynamic application remapping either rely on fat binaries (that significantly enhance configuration memory requirements) or consider a centralized memory. To extract the benefits of both runtime remapping and distributed scratchpad memories, we present a design framework called DyMeP. DyMeP relies on late binding and provides the architectural support to dynamically remap data in CGRAs. Compared to the state of the art, the proposed technique reduces the configuration memory requirements (needed by fat binary solutions) and supports distributed shared scratchpad memory. Synthesis/Simulation results reveal that DyMeP promises a significant (up to 60%) reduction in config ware size at the cost of negligible additional overheads (less then 1%).
10.	Tajammul, Muhammad Adeel, et al. (author) Private configuration environments (PCE) for efficient reconfiguration, in CGRAs 2013 In: Proceedings Of The 2013 IEEE 24th International Conference On Application-Specific Systems, Architectures And Processors (ASAP 13). - : IEEE Computer Society. - 9781479904921 ; , s. 227-236 Conference paper (peer-reviewed)abstract In this paper, we propose a polymorphic configuration architecture, that can be tailored to efficiently support reconfiguration needs of the applications at runtime. Today, CGRAs host multiple applications, running simultaneously on a single platform. Novel CGRAs allow each application to exploit late binding and time sharing for enhancing the power and area efficiency. These features require frequent reconfigurations, making reconfiguration time a bottleneck for time critical applications. Existing solutions to this problem either employ powerful configuration architectures or hide configuration latency (using configuration caching). However, both these methods incur significant costs when designed for worst-case reconfiguration needs. As an alternative to worst-case dedicated configuration mechanism, we exploit reconfiguration to provide each application its private configuration environment (PCE). PCE relies on a morphable configuration infrastructure, a distributed memory sub-system, and a set of PCE controllers. The PCE controllers customize the morphable configuration infrastructure and reserve portion of the a distributed memory sub-system, to act as a context memory for each application, separately. Thereby, each application enjoys its own configuration environment which is optimal in terms of configuration speed, memory requirements and energy. Simulation results using representative applications (WLAN and Matrix Multiplication) showed that PCE offers up to 58 % reduction in memory requirements, compared to dedicated, worst case configuration architecture. Synthesis results show that the morphable reconfiguration architecture incurs negligible overheads (3 % area and 4 % power compared of a single processing element).

Skapa referenser, mejla, bekava och länka

Permalink

Result 1-10 of 11

Refine your search

Type of publication: conference paper (8); journal article (3)

Type of content: peer-reviewed (11)

Author/Editor: Tajammul, Muhammad A ... (9); Hemani, Ahmed (8); Hemani, Ahmed, 1961- (3); Tenhunen, Hannu (3); Jafri, Syed Mohammad ... (3); Plosila, Juha (3); show more...; Shami, Muhammad Ali (3); Paul, Kolin (3); Farahini, Nasim (2); Tajammul, Muhammad A ... (2); Ellervee, Peeter (1); Li, Shuo (1); Plosila, J. (1); Ellervee, P. (1); Chen, Guo (1); Ye, Wei (1); Sohofi, Hassan (1); Jafri, Syed M. A. H. (1); Jafri, Syed (1); Tenuhnen, Hannu (1); Shami, Muhammad Ali, ... (1); Moorthi, Sridharan M ... (1); Jafri, Syed M. A. (1); Ellerve, P. (1); Shami, Muhammad Ali, ... (1); Moorthi, Sridharan (1); show less...

University: Royal Institute of Technology (11)

Language: English (11)

Research subject (UKÄ/SCB): Engineering and Technology (7); Natural sciences (3)

Year

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

Copyright © LIBRIS - National Library Systems
LIBRIS.kb.se

pil uppåt

Close

Copy and save the link in order to return to this view