SwePub - sökning: WFRF:(Brorsson Mats 1962 )

Numrering	Referens	Omslagsbild	Hitta
1.	Ayguadé, Eduard, et al. (författare) OpenMP Performance Analysis in the INTONE Project 2001 Konferensbidrag (refereegranskat)
2.	Bao, Yan, et al. (författare) An Implementation of Cache-Coherence for the Nios II ™ Soft-core processor 2009 Konferensbidrag (refereegranskat)abstract Soft-core programmable processors mapped onto fieldprogrammable gate arrays (FPGA) can be considered as equivalents to a microcontroller. They combine central processing units (CPUs), caches, memories, and peripherals on a single chip. Soft-cores processors represent an increasingly common embedded software implementation option. Modern FPGA soft-cores are parameterized to support application-specific customization. However, these softcore processors are designed to be used in uniprocessor system, not for multiprocessor system. This project describes an implementation to solve the cache coherency problem in an ALTERA Nios II soft-core multiprocessor system.
3.	Bhatti, Muhammad Khurram, et al. (författare) Locality-aware task scheduling for homogeneous parallel computing systems 2018 Ingår i: Computing. - : Springer Science and Business Media LLC. - 0010-485X .- 1436-5057. ; 100:6, s. 557-595 Tidskriftsartikel (refereegranskat)abstract In systems with complex many-core cache hierarchy, exploiting data locality can significantly reduce execution time and energy consumption of parallel applications. Locality can be exploited at various hardware and software layers. For instance, by implementing private and shared caches in a multi-level fashion, recent hardware designs are already optimised for locality. However, this would all be useless if the software scheduling does not cast the execution in a manner that promotes locality available in the programs themselves. Since programs for parallel systems consist of tasks executed simultaneously, task scheduling becomes crucial for the performance in multi-level cache architectures. This paper presents a heuristic algorithm for homogeneous multi-core systems called locality-aware task scheduling (LeTS). The LeTS heuristic is a work-conserving algorithm that takes into account both locality and load balancing in order to reduce the execution time of target applications. The working principle of LeTS is based on two distinctive phases, namely; working task group formation phase (WTG-FP) and working task group ordering phase (WTG-OP). The WTG-FP forms groups of tasks in order to capture data reuse across tasks while the WTG-OP determines an optimal order of execution for task groups that minimizes the reuse distance of shared data between tasks. We have performed experiments using randomly generated task graphs by varying three major performance parameters, namely: (1) communication to computation ratio (CCR) between 0.1 and 1.0, (2) application size, i.e., task graphs comprising of 50-, 100-, and 300-tasks per graph, and (3) number of cores with 2-, 4-, 8-, and 16-cores execution scenarios. We have also performed experiments using selected real-world applications. The LeTS heuristic reduces overall execution time of applications by exploiting inter-task data locality. Results show that LeTS outperforms state-of-the-art algorithms in amortizing inter-task communication cost.
4.	Bhatti, Muhammad Khurram, et al. (författare) Noodle : A heuristic algorithm for task scheduling in MPSoC architectures 2014 Ingår i: Proceedings - 2014 17th Euromicro Conference on Digital System Design, DSD 2014. - : Institute of Electrical and Electronics Engineers Inc.. - 9781479957934 ; , s. 667-670 Konferensbidrag (refereegranskat)abstract Task scheduling is crucial for the performance of parallel applications. Given dependence constraints between tasks, their arbitrary sizes, and bounded resources available for execution, optimal task scheduling is considered as an NP-hard problem. Therefore, proposed scheduling algorithms are based on heuristics. This paper1 presents a novel heuristic algorithm, called the Noodle heuristic, which differs from the existing list scheduling techniques in the way it assigns task priorities. We conduct an extensive experimental to validate Noodle for task graphs taken from Standard Task Graph (STG). Results show that Noodle produces schedules that are within a maximum of 12% (in worst-case) of the optimal schedule for 2, 4, and 8 core systems. We also compare Noodle with existing scheduling heuristics and perform comparative analysis of its performance.
5.	Brorsson, Mats, 1962-, et al. (författare) Adaptive and flexible dictionary code compression for embedded applications 2006 Ingår i: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems. - New York, NY, USA : ACM. - 1595935436 ; , s. 113-124 Konferensbidrag (refereegranskat)abstract Dictionary code compression is a technique where long instructions in the memory are replaced with shorter code words used as index in a table to look up the original instructions. We present a new view of dictionary code compression for moderately high-performance processors for embedded applications. Previous work with dictionary code compression has shown decent performance and energy savings results which we verify with our own measurement that are more thorough than previously published. We also augment previous work with a more thorough analysis on the effects of cache and line size changes. In addition, we introduce the concept of aggregated profiling to allow for two or more programs to share the same dictionary contents. Finally, we also introduce dynamic dictionaries where the dictionary contents is considered to be part of the context of a process and show that the performance overhead of reloading the dictionary contents on a context switch is negligible while on the same time we can save considerable energy with a more specialized dictionary contents.
6.	Brorsson, Mats, 1962- (författare) MipsIt-a simulation and development environment using animation for computer architecture education 2002 Konferensbidrag (övrigt vetenskapligt/konstnärligt)abstract Computer animation is a tool which nowadays is used in more and more fields. In this paper we describe the use of computer animation to support the learning of computer organization itself. MipsIt is a system consisting of a software development environment, a system and cache simulator and a highly flexible microarchitecture simulator used for pipeline studies. It has been in use for several years now and constitutes an important tool in the education at Lund University and KTH, Royal Institute of Technology in Sweden.
7.	Collin, Mikael, et al. (författare) A performance and energy exploration of dictionary code compression architectures 2011 Ingår i: 2011 International Green Computing Conference and Workshops (IGCC). - : IEEE conference proceedings. - 9781457712227 ; , s. 1-8 Konferensbidrag (refereegranskat)abstract We have made a performance and energy exploration of a previously proposed dictionary code compression mechanism where frequently executed individual instructions and/or sequences are replaced in memory with short code words. Our simulated design shows a dramatically reduced instruction memory access frequency leading to a performance improvement for small instruction cache sizes and to significantly reduced energy consumption in the instruction fetch path. We have evaluated the performance and energy implications of three architectural parameters: branch prediction accuracy, instruction cache size and organization. To asses the complexity of the design we have implemented the critical stages in VHDL.
8.	Collin, Mikael, et al. (författare) Low Power Instruction Fetch using Profiled Variable Length Instructions 2003 Konferensbidrag (refereegranskat)abstract Computer system performance depends on high access rate and low miss rate in the instruction cache, which also affects energy consumed by fetching instructions. Simulation of a small computer typical for embedded systems show that up to 20% of the overall processor energy is consumed in the instruction fetch path and as much as 23% of the execution time is spent on instruction fetch. One way to increase the instruction memory bandwidth is to fetch more instructions each access without increasing the bus width. We propose an extension to a RISC ISA, with variable length instructions, yielding higher information density without compromising programmability. Based on profiling of dynamic instruction usage and argument locality of a set of SPEC CPU2000 applications, we present a scheme using 8- 16- and 24-bit instructions accompanied by lookup tables inside the processor. Our scheme yields a 20-30% reduction in static memory usage, and experiments show that up to 60% of all executed instructions consist of short instructions. The overall energy savings are up to 15% for the entire data path and memory system, and up to 20% in the instruction fetch path.
9.	Collin, Mikael, et al. (författare) Low Power Instruction Fetch using Variable Length Instructions 2003 Konferensbidrag (refereegranskat)
10.	Du, M., et al. (författare) Improving real-time bidding using a constrained markov decision process 2017 Ingår i: 13th International Conference on Advanced Data Mining and Applications, ADMA 2017. - Cham : Springer. - 9783319691787 ; , s. 711-726 Konferensbidrag (refereegranskat)abstract Online advertising is increasingly switching to real-time bidding on advertisement inventory, in which the ad slots are sold through real-time auctions upon users visiting websites or using mobile apps. To compete with unknown bidders in such a highly stochastic environment, each bidder is required to estimate the value of each impression and to set a competitive bid price. Previous bidding algorithms have done so without considering the constraint of budget limits, which we address in this paper. We model the bidding process as a Constrained Markov Decision Process based reinforcement learning framework. Our model uses the predicted click-through-rate as the state, bid price as the action, and ad clicks as the reward. We propose a bidding function, which outperforms the state-of-the-art bidding functions in terms of the number of clicks when the budget limit is low. We further simulate different bidding functions competing in the same environment and report the performances of the bidding strategies when required to adapt to a dynamic environment.
11.	Du, M., et al. (författare) Time series modeling of market price in real-time bidding 2019 Ingår i: ESANN 2019 - Proceedings, 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. - : ESANN. ; , s. 643-648 Konferensbidrag (refereegranskat)abstract Real-Time-Bidding (RTB) is one of the most popular online advertisement selling mechanisms. Modeling the highly dynamic bidding environment is crucial for making good bids. Market prices of auctions fluctuate heavily within short time spans. State-of-the-art methods neglect the temporal dependencies of bidders’ behaviors. In this paper, the bid requests are aggregated by time and the mean market price per aggregated segment is modeled as a time series. We show that the Long Short Term Memory (LSTM) neural network outperforms the state-of-the-art univariate time series models by capturing the nonlinear temporal dependencies in the market price. We further improve the predicting performance by adding a summary of exogenous features from bid requests.
12.	Fang, Huan, et al. (författare) Scalable directory architecture for distributed shared memory chip multiprocessors 2008 Ingår i: Proceedings of the 1st Swedish Workshop on Multi-core Computing. ; , s. 73-81 Konferensbidrag (refereegranskat)abstract Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main memory, we consider distributing the directory together with L2 cache across all nodes on a Chip Multiprocessor. Each node contains a processing unit, a private L1 cache, a slice of L2 cache, memory controller and a router. Both L2 cache and memories are distributed shared and interleaved by a subset of memory address bits. All nodes are interconnected through a low latency two dimensional Mesh network.Directory, as a split component as L2 cache, only stores sharing information for blocks while L2 cache only stores data blocks exclusive with L1 cache. Shared L2 cache can increase total effective cache capacity on chip, but also increase the miss latency when data is on a remote node. Being different from Directory Cache structure, our proposal totally removes the directory from memory which saves memory space and reduces access latency. Compared to L2 cache which combines directory information internally, our split L2 cache structure saves over 88% cache space while having achieved similar performance.
13.	Fang, Huan, et al. (författare) Scalable directory architecture for distributed shared memory chip multiprocessors 2008 Ingår i: SIGARCH Computer Architecture News. - : ACM Press. - 0163-5964 .- 1943-5851. ; 36:5, s. 56-64 Tidskriftsartikel (refereegranskat)abstract Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main memory, we consider distributing the directory together with L2 cache across all nodes on a Chip Multiprocessor. Each node contains a processing unit, a private L1 cache, a slice of L2 cache, memory controller and a router. Both L2 cache and memories are distributed shared and interleaved by a subset of memory address bits. All nodes are interconnected through a low latency two dimensional Mesh network. Directory, being a split component to L2 cache, only stores sharing information for blocks while L2 cache stores only data blocks exclusive with L1 cache. Shared L2 cache can increase total effective cache capacity on chip, but also increase the miss latency when data is on a remote node. Being different from Directory Cache structure, our proposal totally removes the directory from memory, which saves memory space and reduces access latency. Compared to L2 cache that combines directory information internally, our L2 cache structure saves up to 88% cache space and achieves similar performance.
14.	Faxén, Karl-Filip, et al. (författare) Multicore computing--the state of the art 2009 Rapport (övrigt vetenskapligt/konstnärligt)abstract This document presents the current state of the art in multicore computing, in hardware and software, as well as ongoing activities, especially in Sweden. To a large extent, it draws on the presentations given at the Multicore Days 2008 organized by SICS, Swedish Multicore Initiative and Ericsson Software Research but the published literature and the experience of the authors has been equally important sources. It is clear that multicore processors will be with us for the foreseeable future; there seems to be no alternative way to provide substantial increases of microprocessor performance in the coming years. While processors with a few (2–8) cores are common today, this number is projected to grow as we enter the era of manycore computing. The road ahead for multicore and manycore hardware seems relatively clear, although some issues like the organization of the on-chip memory hierarchy remain to be settled. Multicore software is however much less mature, with fundamental questions of programming models, languages, tools and methodologies still outstanding.
15.	Issa, Shady, 1989- (författare) Techniques for Enhancing the Efficiency of Transactional Memory Systems 2018 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract Transactional Memory (TM) is an emerging programming paradigm that drastically simplifies the development of concurrent applications by relieving programmers from a major source of complexity: how to ensure correct, yet efficient, synchronization of concurrent accesses to shared memory. Despite the large body of research devoted to this area, existing TM systems still suffer from severe limitations that hamper both their performance and energy efficiency.This dissertation tackles the problem of how to build efficient implementations of the TM abstraction by introducing innovative techniques that address three crucial limitations of existing TM systems by: (i) extending the effective capacity of Hardware TM (HTM) implementations; (ii) reducing the synchronization overheads in Hybrid TM (HyTM) systems; (iii) enhancing the efficiency of TM applications via energy-aware contention management schemes.The first contribution of this dissertation, named POWER8-TM (P8TM), addresses what is arguably one of the most compelling limitations of existing HTM implementations: the inability to process transactions whose footprint exceeds the capacity of the processor's cache. By leveraging, in an innovative way, two hardware features provided by IBM POWER8 processors, namely Rollback-only Transactions and Suspend/Resume, P8TM can achieve up to 7x performance gains in workloads that stress the capacity limitations of HTM.The second contribution is Dynamic Memory Partitioning-TM (DMP-TM), a novel Hybrid TM (HyTM) that offloads the cost of detecting conflicts between HTM and Software TM (STM) to off-the-shelf operating system memory protection mechanisms. DMP-TM's design is agnostic to the STM algorithm and has the key advantage of allowing for integrating, in an efficient way, highly scalable STM implementations that would, otherwise, demand expensive instrumentation of the HTM path. This allows DMP-TM to achieve up to 20x speedups compared to state of the art HyTM solutions in uncontended workloads.The third contribution, Green-CM, is an energy-aware Contention Manager (CM) that has two main innovative aspects: (i) a novel asymmetric design, which combines different back-off policies in order to take advantage of Dynamic Frequency and Voltage Scaling (DVFS) hardware capabilities, available in most modern processors; (ii) an energy efficient implementation of a fundamental building block for many CM implementations, namely, the mechanism used to back-off threads for a predefined amount of time. Thanks to its innovative design, Green-CM can reduce the Energy Delay Product by up to 2.35x with respect to state of the art CMs.All the techniques proposed in this dissertation share an important common feature that is essential to preserve the ease of use of the TM abstraction: the reliance on on-line self-tuning mechanisms that ensure robust performance even in presence of heterogeneous workloads, without requiring prior knowledge of the target workloads or architecture.
16.	Javed Awan, Ahsan, 1988-, et al. (författare) Identifying the potential of Near Data Processing for Apache Spark 2017 Ingår i: Proceedings of the International Symposium on Memory Systems, MEMSYS 2017. - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 60-67 Konferensbidrag (refereegranskat)abstract While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. There is also a renewed interest in Near Data Processing (NDP) due to technological advancement in the last decade. However, it is not known if NDP architectures can improve the performance of big data processing frameworks such as Apache Spark. In this paper, we build the case of NDP architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and instorage processing for Apache Spark, by extensive profiling of Apache Spark based workloads on Ivy Bridge Server.
17.	Karlsson, Sven, et al. (författare) A comparative characterization of communication patterns in applications using MPI and shared memory on an IBM SP2 1998 Konferensbidrag (refereegranskat)abstract In this paper we analyze the characteristics of communication in three different applications, FFT, Barnes and Water, on an IBM SP2. We contrast the communication using two different programming models: message-passing, MPI, and shared memory, represented by a state-of-the-art distributed virtual shared memory package, TreadMarks. We show that while communication time and busy times are comparable for small systems, the communication patterns are fundamentally different leading to poor performance for TreadMarks-based applications when the number of processors increase. This is due to the request/reply technique used in TreadMarks that results in a large fraction of very small messages. However, if the application can be tuned to reduce the impact of small message communication it is possible to achieve acceptable performance at least up to 32 nodes. Our measurements also show that TreadMarks programs tend to cause a more even network load compared to MPI programs
18.	Karlsson, Sven, et al. (författare) A free openmp compiler and run-time library infrastructure for research on shared memory parallel computing 2004 Ingår i: Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems. - : ACTA Press. ; , s. 354-361 Konferensbidrag (refereegranskat)abstract OpenMP is an informal industry standard for programming parallel computers with a shared memory and has during the last few years achieved considerable acceptance in both the academic world and the industry. OpenMP is a thread level fork-join programming model and relies on a set of compiler directives. An OpenMP aware compiler uses these directives to generate a multi-threaded application. In practice, an OpenMP run-time library is also needed as OpenMP specifies a set of run-time library calls. In this paper we report on a free OpenMP compiler and run-time library infrastructure. We present an OpenMP compiler for C called OdinMP and briefly discuss the run time library that the compiler targets. The source code to both the compiler and the run-time libraries are available and can be freely used for OpenMP research. The compilation system is evaluated using the EPCC micro-benchmark suite for OpenMP and a set of appli cations from the SPLASH-2 benchmarks suite ported to OpenMP. Comparisons are made to OpenMP aware com piler systems from SGI and Intel. The performance of code generated with the pre sented compilation system is shown to be very close to or exceeding that of commercial compilers for a wide range of benchmark applications.
19.	Karlsson, Sven, et al. (författare) A Fully Compliant OpenMP implementation on Software Distributed Shared Memory 2002 Konferensbidrag (refereegranskat)abstract OpenMP is a relatively new industry standard for programming parallel computers with a shared memory programming model. Given that clusters of workstations are a cost-effective solution to build parallel platforms, it would of course be highly interesting if the OpenMP model could be extended to these systems as well as to the standard shared memory architectures for which it was originally intended. We present in this paper a fully compliant implementation of the OpenMP specification 1.0 for C targeting networks of workstations. We have used an experimental software distributed shared memory system, CVM, to implement a run-time library which is the target of a source-to-source OpenMP translator also developed in this project. The system has been evaluated using an OpenMP microbenchmark suite used to evaluate the effect of some memory coherence protocol improvements. We have also used OpenMP versions of three Splash-2 applications concluding in reasonable speedups on an IBM SP machine with eight nodes. This is the first study to investigate the subtle mechanisms of consistency in OpenMP on software DSM systems.
20.	Karlsson, S., et al. (författare) An Infrastructure for Portable and Efficient Software DSM 1999 Konferensbidrag (refereegranskat)
21.	Karlsson, S., et al. (författare) Priority Based Messaging for Software Distributed Shared Memory 2003 Ingår i: Cluster Computing. ; 6, s. 161-169 Tidskriftsartikel (refereegranskat)
22.	Karlsson, Sven, et al. (författare) Priority Based Messaging for Software Distributed Shared Memory – Model and Implementation 2001 Konferensbidrag (refereegranskat)
23.	Karlsson, S., et al. (författare) Producer-push-a protocol enhancement to page-based software distributed shared memory systems 1999 Ingår i: Proceedings of ICPP’99. ; , s. 291-300 Konferensbidrag (refereegranskat)abstract This paper describes a technique called producer-push that enhances the performance of a page-based software distributed shared memory system. Shared data, in software DSM systems, must normally be requested from the node that produced the latest value. Producer-push utilizes the execution history to predict this communication so that the data is pushed to the consumer before it is requested. In contrast to previously proposed mechanisms to proactively send data to where it is needed, producer-push uses information about the source code location of communication to more accurately predict the needed communication. Producer-push requires no source code modifications of the application and it effectively reduces the latency of shared memory accesses. This is confirmed by our performance evaluation which shows that the average time to wait for memory updates is reduced by 74%. Producer-push also changes the communication pattern of an application making it more suitable for modern networks. The latter is a result of a 44% reduction of the average number of messages and an enlargement of the average message size by 65%.
24.	Muddukrishna, Ananya, et al. (författare) Locality-aware task scheduling and data distribution on NUMA systems 2013 Ingår i: Lecture Notes in Computer Science. - Berlin, Heidelberg : Springer Berlin Heidelberg. - 9783642406973 ; , s. 156-170 Konferensbidrag (refereegranskat)abstract Modern parallel computer systems exhibit Non-Uniform Memory Access (NUMA) behavior. For best performance, any parallel program therefore has to match data allocation and scheduling of computations to the memory architecture of the machine. When done manually, this becomes a tedious process and since each individual system has its own peculiarities this also leads to programs that are not performance-portable. We propose the use of a data distribution scheme in which NUMA hardware peculiarities are abstracted away from the programmer and data distribution is delegated to a runtime system which is generated once for each machine. In addition we propose using task data dependence information now possible with the OpenMP 4.0RC2 proposal to guide the scheduling of OpenMP tasks to further reduce data stall times. We demonstrate the viability and performance of our proposals on a four socket AMD Opteron machine with eight NUMA nodes. We identify that both data distribution and locality-aware task scheduling improves performance compared to default policies while still providing an architecture-oblivious approach for the programmer.
25.	Nikitovic, Mladen, et al. (författare) A multiprogrammed workload model for energy and performance estimation of adaptive chip-multiprocessors 2004 Ingår i: Proceedings of 18th International Parallel and Distributed Processing Symposium, 2004. - : IEEE. - 0769521320 ; , s. 3449-3456 Konferensbidrag (refereegranskat)abstract Summary form only given. Today, there is a trend towards steadily increasing functionality in mobile terminals. This trend in turn increases the performance demand on the architecture that is supposed to do all the work. It is likely that more traditional architectures like multiprocessors are used in future mobile terminals. They are attractive because they can now be integrated on a single chip and can provide the desired performance efficiently if intelligently managed. Choosing the most efficient architecture configuration is however a complex issue and depends on multiple factors. We believe that the way the behavior of the workload is modeled is of paramount importance when estimating the efficiency of any proposed architecture for future mobile terminals. Therefore, a deterministic and simple workload description is needed. In this paper, we show how such a multiprogrammed workload is created and used for energy and performance estimation of an adaptive chip-multiprocessor (CMP) architecture.
26.	Nikitovic, Mladen, et al. (författare) A study on periodic shutdown for adaptive CMPs in handheld devices 2008 Ingår i: 2008 13TH ASIA-PACIFIC COMPUTER SYSTEMS ARCHITECTURE CONFERENCE. - New York : IEEE. - 9781424426829 ; , s. 308-314 Konferensbidrag (refereegranskat)abstract The challenge to satisfy the demand for higher computing performance has become an increasingly difficult task to achieve. In the area of mobile devices, this demand has to be carefully balanced with an efficient use of the power source. We propose the use of an adaptive architecture that enables savings in power and energy in an intuitive way, considering the properties of future process technologies. We satisfy performance demand by utilizing thread-level parallelism and minimize the power and energy consumption by proposing an adaptive strategy that manages the power state of each individual CMP-core. In this study, we propose a periodical shutdown strategy and evaluate it in a multiprogrammed workload environment. Results show that a large amount of idle time, 77 %, can be saved by putting processors into power-saving states. Furthermore, introducing timeouts can dramatically decrease the number of state transitions.
27.	Nikitovic, Mladen, et al. (författare) An adaptive chip-multiprocessor architecture for future mobile terminals 2002 Ingår i: CASES ’02. - New York, New York, USA : ACM Press. - 1581135750 ; , s. 43-49 Konferensbidrag (refereegranskat)
28.	Oz, Isil, et al. (författare) Regression-Based Prediction for Task-Based Program Performance 2019 Ingår i: Journal of Circuits, Systems and Computers. - : WORLD SCIENTIFIC PUBL CO PTE LTD. - 0218-1266. ; 8:4 Tidskriftsartikel (refereegranskat)abstract As multicore systems evolve by increasing the number of parallel execution units, parallel programming models have been released to exploit parallelism in the applications. Task-based programming model uses task abstractions to specify parallel tasks and schedules tasks onto processors at runtime. In order to increase the efficiency and get the highest performance, it is required to identify which runtime configuration is needed and how processor cores must be shared among tasks. Exploring design space for all possible scheduling and runtime options, especially for large input data, becomes infeasible and requires statistical modeling. Regression-based modeling determines the effects of multiple factors on a response variable, and makes predictions based on statistical analysis. In this work, we propose a regression-based modeling approach to predict the task-based program performance for different scheduling parameters with variable data size. We execute a set of task-based programs by varying the runtime parameters, and conduct a systematic measurement for influencing factors on execution time. Our approach uses executions with different configurations for a set of input data, and derives different regression models to predict execution time for larger input data. Our results show that regression models provide accurate predictions for validation inputs with mean error rate as low as 6.3%, and 14% on average among four task-based programs.
29.	Podobas, Artur, et al. (författare) A Comparison of some recent Task-based Parallel Programming Models 2010. - 8 Ingår i: Proceedings of the 3rd Workshop on Programmability Issues for Multi-Core Computers, (MULTIPROG'2010), Jan 2010, Pisa. Konferensbidrag (refereegranskat)abstract The need for parallel programming models that are simple to use and at the same time efficient for current ant future parallel platforms has led to recent attention to task-based models such as Cilk++, Intel TBB and the task concept in OpenMP version 3.0. The choice of model and implementation can have a major impact on the final performance and in order to understand some of the trade-offs we have made a quantitative study comparing four implementations of OpenMP (gcc, Intel icc, Sun studio and the research compiler Mercurium/nanos mcc), Cilk++ and Wool, a high-performance task-based library developed at SICS. Abstract. We use microbenchmarks to characterize costs for task-creation and stealing and the Barcelona OpenMP Tasks Suite for characterizing application performance. By far Wool and Cilk++ have the lowest overhead in both spawning and stealing tasks. This is reflected in application performance when many tasks with small granularity are spawned where Cilk++ and, in particular, has the highest performance. For coarse granularity applications, the OpenMP implementations have quite similar performance as the more light-weight Cilk++ and Wool except for one application where mcc is superior thanks to a superior task scheduler. Abstract. The OpenMP implemenations are generally not yet ready for use when the task granularity becomes very small. There is no inherent reason for this, so we expect future implementations of OpenMP to focus on this issue.
30.	Podobas, Artur, et al. (författare) Architecture-aware Task-scheduling : A thermal approach 2011 Ingår i: http://faspp.ac.upc.edu/faspp11. Konferensbidrag (refereegranskat)abstract Current task-centric many-core schedulers share a “naive” view of processor architecture; a view that does not care about its thermal, architectural or power consuming properties. Future processor will be more heterogeneous than what we see today, and following Moore’s law of transistor doubling, we foresee an increase in power consumption and thus temperature. Thermal stress can induce errors in processors, and so a common way to counter this is by slowing the processor down; something task-centric schedulers should strive to avoid. The Thermal-Task-Interleaving scheduling algorithm proposed in this paper takes both the application temperature behavior and architecture into account when making decisions. We show that for a mixed workload, our scheduler outperforms some of the standard, architecture-unaware scheduling solutions existing today.
31.	Radu, M., et al. (författare) Work in progress - graduate exchange program in microelectronics system engineering 2008 Ingår i: FIE. - 9781424419692 ; , s. 563-564 Konferensbidrag (refereegranskat)abstract In todaypsilas world, where new technologies emerge and advance at a very fast pace every year, many professional societies are discussing moving to a Master level program as a ldquofirst professional degreerdquo, anticipating graduates with advances skills for tomorrowpsilas demanding and advanced industry. In this context, the education at the master level is becoming more and more important. Another key issue in todaypsilas world is the impact of globalization process (needs of multinational corporations). The engineering education must address the impact of global hiring. The graduates entering the global workplace must possess besides the essential technical skills, also cultural, social and communication skills, enabling them to work and interact in international environments, bringing creativity and innovative development in multi-cultural groups. In this context, exchange programs between different universities, located in different countries and continents are flourishing, the universities trying to integrate study-abroad components in their programs. This paper is presenting as a ldquoWork in Progressrdquo, the first steps related to an exchange program at the graduate level in the area of Microelectronics, between two prestigious universities located in USA (Rose Hulman Institute of Technology, Terre Haute, IN) and Sweden (Royal Institute of Technology, Stockholm). A Joint Degree or Dual Degree program at the Master Level is envisaged in the near future.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Träfflista för sökning "WFRF:(Brorsson Mats 1962 ) "

Avgränsa träffmängd

År