SwePub - sökning: WFRF:(Tsigas Philippas 1967)

Numrering	Referens	Omslagsbild	Hitta
1.	Almgren, Magnus, 1972, et al. (författare) Mapping Systems Security Research at Chalmers 2011 Ingår i: First SysSec Workshop (SysSec 2011). - 9780769545301 ; , s. 67-70 Konferensbidrag (refereegranskat)abstract The department of Computer Science and Engineering at Chalmers University has a long tradition of research in systems security, including security metrics, attack detection, and mitigation. We focus on security issues arising in four specific environments: (1) backbone links, (2) sensor networks, (3) the connected car, and (4) the smart grid. In this short summary we describe recent results as well as open research questions we are exploring.
2.	Assarsson, Ulf, 1972, et al. (författare) Image-Space Dynamic Transparency for Improved Object Discovery in 3D Environments 2006 Rapport (övrigt vetenskapligt/konstnärligt)abstract We present an image-space algorithm for dynamic transparency with the purpose of supporting object discovery and access in information-rich 3D visualization environments. The algorithm is based on multiple rendering passes and detects instances of object occlusion in the image-space using the fragment shader capabilities of modern programmable graphics hardware, creating alpha maps of opacity gradients around the occluded objects. In essence, the effect is somewhat similar to X-ray vision of a superhero. We have implemented a prototype version of our algorithm with real-time rendering performance using a number of optimizations and speedups on current graphics hardware. To evaluate its use, we have also implemented three different example applications portraying different scenarios, including abstract visualization, virtual walkthrough, and gaming. Preliminary results from an empirical user study comparing our technique to standard viewpoint controls indicate that our technique is superior in regards to object discovery efficiency. These results hold over both completion times as well as correctness of a visual search task.
3.	Atalar, Aras, 1985, et al. (författare) Analyzing the Performance of Lock-Free Data Structures: A Conflict-Based Model 2015 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. - 9783662486535 ; 9363, s. 341-355 Konferensbidrag (refereegranskat)abstract This paper considers the modeling and the analysis of the performance of lock-free concurrent data structures that can be represented as linear combinations of fixed size retry loops. Our main contribution is a new way of modeling and analyzing a general class of lock-free algorithms, achieving predictions of throughput that are close to what we observe in practice. We emphasize two kinds of conflicts that shape the performance: (i) hardware conflicts, due to concurrent calls to atomic primitives; (ii) logical conflicts, caused by concurrent operations on the shared data structure. We propose also a common framework that enables a fair comparison between lock-free implementations by covering the whole contention domain, and comes with a method for calculating a good back-off strategy. Our experimental results, based on a set of widely used concurrent data structures and on abstract lock-free designs, show that our analysis follows closely the actual code behavior.(1)
4.	Atalar, Aras, 1985, et al. (författare) How lock-free data structures perform in dynamic environments: Models and analyses 2017 Ingår i: Leibniz International Proceedings in Informatics, LIPIcs. - 1868-8969. - 9783959770316 ; 70, s. 23.1-23.17 Konferensbidrag (refereegranskat)abstract © Aras Atalar, Paul Renaud-Goud, and Philippas Tsigas.In this paper we present two analytical frameworks for calculating the performance of lock-free data structures. Lock-free data structures are based on retry loops and are called by application-specific routines. In contrast to previous work, we consider in this paper lock-free data structures in dynamic environments. The size of each of the retry loops, and the size of the application routines invoked in between, are not constant but may change dynamically. The new frameworks follow two different approaches. The first framework, the simplest one, is based on queuing theory. It introduces an average-based approach that facilitates a more coarse-grained analysis, with the benefit of being ignorant of size distributions. Because of this independence from the distribution nature it covers a set of complicated designs. The second approach, instantiated with an exponential distribution for the size of the application routines, uses Markov chains, and is tighter because it constructs stochastically the execution, step by step. Both frameworks provide a performance estimate which is close to what we observe in practice. We have validated our analysis on (i) several fundamental lock-free data structures such as stacks, queues, deques and counters, some of them employing helping mechanisms, and (ii) synthetic tests covering a wide range of possible lock-free designs. We show the applicability of our results by introducing new back-off mechanisms, tested in application contexts, and by designing an efficient memory management scheme that typical lock-free algorithms can utilize.
5.	Atalar, Aras, 1985, et al. (författare) Lock-Free Search Data Structures: Throughput Modeling with Poisson Processes 2019 Ingår i: Leibniz International Proceedings in Informatics, LIPIcs. - 1868-8969. ; 125 Konferensbidrag (refereegranskat)abstract This paper considers the modeling and the analysis of the performance of lock-free concurrent search data structures. Our analysis considers such lock-free data structures that are utilized through a sequence of operations which are generated with a memoryless and stationary access pattern. Our main contribution is a new way of analyzing lock-free concurrent search data structures: our execution model matches with the behavior that we observe in practice and achieves good throughput predictions. Search data structures are formed of basic blocks, usually referred to as nodes, which can be accessed by two kinds of events, characterized by their latencies; (i) CAS events originated as a result of modifications of the search data structure (ii) Read events that occur during traversals. An operation triggers a set of events, and the running time of an operation is computed as the sum of the latencies of these events. We identify the factors that impact the latency of such events on a multi-core shared memory system. The main challenge (though not the only one) is that the latency of each event mainly depends on the state of the caches at the time when it is triggered, and the state of caches is changing due to events that are triggered by the operations of any thread in the system. Accordingly, the latency of an event is determined by the ordering of the events on the timeline. Search data structures are usually designed to accommodate a large number of nodes, which makes the occurrence of an event on a given node rare at any given time. In this context, we model the events on each node as Poisson processes from which we can extract the frequency and probabilistic ordering of events that are used to estimate the expected latency of an operation, and in turn the throughput. We have validated our analysis on several fundamental lock-free search data structures such as linked lists, hash tables, skip lists and binary trees.
6.	Atalar, Aras, 1985, et al. (författare) Modeling and Analyzing the Performance of Lock-Free Data Structures 2014 Rapport (övrigt vetenskapligt/konstnärligt)abstract This paper considers the modeling and the analysis of the performance of lock-free concurrent datastructures. Lock-free designs employ an optimistic conflict control approach, allowing several processesto access the shared data object at the same time. The operations on these data structures are typicallydesigned as compositions of retry loops.Our main contribution is a new way of modeling and analyzing a general class of lock-free algorithms,achieving predictions of throughput that are close to what we observe in practice. In our model weintroduce two key metrics that shape the performance of lock-free algorithms: (i) expansion in executiontime of a retry due to memory congestion and (ii) number of wasted retries. We show how to computethese two metrics, and how to combine them, to calculate the throughput of an arguably large class oflock-free algorithms. Our analysis also captures the throughput performance of a lock-free algorithm whenexecuted as part of a larger parallel application. This part of our analysis leads to an analytical methodfor calculating a good back-off strategy to finely tune the performance of a lock-free application. Ourexperimental results, based on a set of widely used concurrent data structures and on abstract lock-freedesigns, show that our analysis follows closely the actual code behavior.To the best of our knowledge, this is the first attempt to make ends meet between theoretical boundson performance and actual measured throughput.
7.	Atalar, Aras, 1985, et al. (författare) Modeling Energy Consumption of Lock-Free Queue Implementations 2015 Ingår i: 29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, Hyderabad, India, 25-29 May. - : IEEE Computer Society. - 1530-2075. - 9781479986484 ; , s. 229-238 Konferensbidrag (refereegranskat)abstract This paper considers the problem of modeling the energy behavior of lock-free concurrent queue data structures. Our main contribution is a way to model the energy behavior of lock-free queue implementations and parallel applications that use them. Focusing on steady state behavior we decompose energy behavior into throughput and power dissipation which can be modeled separately and later recombined into several useful metrics, such as energy per operation. Based on our models, instantiated from synthetic benchmark data, and using only a small amount of additional application specific information, energy and throughput predictions can be made for parallel applications that use the respective data structure implementation. To model throughput we propose a generic model for lock-free queue throughput behavior, based on a combination of the dequeuers' throughput and enqueuers' throughput. To model power dissipation we commonly split the contributions from the various computer components into static, activation and dynamic parts, where only the dynamic part depends on the actual instructions being executed. To instantiate the models a synthetic benchmark explores each queue implementation over the dimensions of processor frequency and number of threads. Finally, we show how to make predictions of application throughput and power dissipation for a parallel application using a lock-free queue requiring only a limited amount of information about the application work done between queue operations. Our case study on a Mandelbrot application shows convincing prediction results.
8.	Bader, D.A., et al. (författare) ACM Journal on Experimental Algorithmics Special Issue on Multicore Algorithms 2012 Ingår i: Journal of Experimental Algorithmics. - : Association for Computing Machinery (ACM). - 1084-6654. ; 17, s. 4.1-4.1 Tidskriftsartikel (refereegranskat)abstract The recent switch to multicore processors brought a dramatic change that affects a large spectrum of systems from embedded and general-purpose to high-end computing systems. Parallelism is forcing major changes in software development. The aim of this issue is to discuss the challenges that parallelism brings to the design and implementation of algorithms and data structures. This special issue arose out of discussions held at the Dagstuhl Seminar 10261, on Algorithm Engineering held June 27–July 2, 2010, in Germany, and organized by Giuseppe F. Italiano (Università di Roma “Tor Vergata,” Italy), David S. Johnson (AT & T Research, Florham Park, NJ), Petra Mutzel (Technical University of Dortmund, Germany), and Peter Sanders (Karlsruhe Institute of Technology, Germany). We conceived a special issue of the ACM Journal on Experimental Algorithmics with a call for original submissions that address implementation and performance issues of multicore algorithms and data structures for any multicore processor, for example, Intel Nehalem, Single-Chip Cloud, NVIDIA and AMD GPUs. An experimental study typically includes an implementation, a series of experiments designed to understand the behavior of the algorithm(s) under study, and a critical discussion of the experiments and their results. We welcomed experimental submissions and encouraged authors to include test data from previously published studies to enable critical comparisons. A total of nine submissions were received, and four were accepted for this special issue. All manuscripts had at least three extensive reviews, and most received five to six reviews. We thank all of the authors for their submissions, and especially the 16 reviewers of these manuscripts.
9.	Benkner, Siegfried, et al. (författare) PEPPHER : Efficient and Productive Usage of Hybrid Computing Systems 2011 Ingår i: IEEE Micro. - : IEEE Institute of Electrical and Electronics. - 0272-1732 .- 1937-4143. ; 31:5, s. 28-41 Tidskriftsartikel (refereegranskat)abstract PEPPHER, a three-year European FP7 project, addresses efficient utilization of hybrid (heterogeneous) computer systems consisting of multicore CPUs with GPU-type accelerators. This article outlines the PEPPHER performance-aware component model, performance prediction means, runtime system, and other aspects of the project. A larger example demonstrates performance portability with the PEPPHER approach across hybrid systems with one to four GPUs.
10.	Benkner, S., et al. (författare) Peppher: Performance Portability and Programmability for Heterogeneous Many-Core Architectures 2017 Ingår i: Programming Multicore and Many-Core Computing Systems. - Hoboken, NJ, USA : John Wiley & Sons, Inc.. - 9781119332015 - 9780470936900 ; , s. 241-260 Bokkapitel (övrigt vetenskapligt/konstnärligt)abstract © 2017 by John Wiley & Sons, Inc. All rights reserved. PEPPHER takes a pluralistic and parallelization agnostic approach to programmability and performance portability for heterogeneous many-core architectures. The PEPPHER framework is in principle language independent but focuses on supporting C++ code with PEPPHER-specific annotations as pragmas or external annotations. The framework is open and extensible; the PEPPHER methodology details how new architectures are incorporated. The PEPPHER methodology consists of rules for how to extend the framework for new architectures. This mainly concerns adaptivity and autotuning for algorithm libraries, the necessary hooks and extensions for the run-time system and any supporting algorithms and data structures that this relies on. Offloading is a specific technique for programming heterogeneous platforms that can sometimes be applied with high efficiency. Offload as developed by the PEPPHER partner Codeplay is a particular, nonintrusive C++ extension allowing portable C++ code to support diverse heterogeneous multicore architectures in a single code base.
11.	Benkner, S., et al. (författare) The PEPPHER approach to programmability and performance portability for heterogeneous many-core architectures 2012 Ingår i: Advances in Parallel Computing. - : IOS Press. - 1879-808X .- 0927-5452. ; 22, s. 361-368, s. 361-368 Konferensbidrag (refereegranskat)abstract The European FP7 project PEPPHER is addressing programmability and performance portability for current and emerging heterogeneous many-core architectures. As its main idea, the project proposes a multi-level parallel execution model comprised of potentially parallelized components existing in variants suitable for different types of cores, memory configurations, input characteristics, optimization criteria, and couples this with dynamic and static resource and architecture aware scheduling mechanisms. Crucial to PEPPHER is that components can be made performance aware, allowing for more efficient dynamic and static scheduling on the concrete, available resources. The flexibility provided in the software model, combined with a customizable, heterogeneous, memory and topology aware run-time system is key to efficiently exploiting the resources of each concrete hardware configuration. The project takes a holistic approach, relying on existing paradigms, interfaces, and languages for the parallelization of components, and develops a prototype framework, a methodology for extending the framework, and guidelines for constructing performance portable software and systems-including paths to migration of existing software-for heterogeneous many-core processors. This paper gives a high-level project overview, and presents a specific example showing how the PEPPHER component variant model and resource-aware run-time system enable performance portability of a numerical kernel. © 2012 The authors and IOS Press. All rights reserved.
12.	Berger, Christian, 1980, et al. (författare) Bridging Physical and Digital Traffic System Simulations with the Gulliver Test-Bed 2013 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. - 9783642379734 ; 7865 LNCS, s. 169-184 Konferensbidrag (refereegranskat)abstract We propose a cyber-physical platform that combines road traffic simulation, network simulation, and physically simulated vehicles to facilitate extensive testing on various levels of vehicular systems. Our design integrates physical and digital vehicle simulation into a common development and testing environment. This paper describes the platform design and presents prototypical implementations that use Simulator of Urban Mobility (SUMO), TinyOS Simulator (TOSSIM), a 3D sensor simulation environment, and a test-bed of miniature vehicles called Gulliver. As a prototypical implementation, we demonstrate the development of cooperative applications, and by that we achieve: (a) a cyber-physical system that provides a common environment for physically and digitally simulated vehicles, (b) a platform to interface communication between physically and digitally simulated vehicles, and (c) the ability to tailor testing scenarios in which some system components are simulated digitally and some physically. The suggested design provides flexibility, cost efficiency, and scalable testing opportunities for future vehicular systems. Furthermore, the proposed system is able to support novel steps towards intelligent transportation systems for smart cities. © 2013 Springer-Verlag.
13.	Berger, Christian, 1980, et al. (författare) On Using the Gulliver Platform for Bridging Physical and Digital Traffic System Simulations 2013 Ingår i: Nationell Konferens i Transportforskning, Chalmers Göteborg. Konferensbidrag (övrigt vetenskapligt/konstnärligt)
14.	Bäckström, Karl, 1994, et al. (författare) ASAP.SGD: Instance-based Adaptiveness to Staleness in Asynchronous SGD 2022 Ingår i: Proceedings of Machine Learning Research. - 2640-3498. ; PMLR 162, s. 1261-1271 Konferensbidrag (refereegranskat)abstract Concurrent algorithmic implementations of Stochastic Gradient Descent (SGD) give rise to critical questions for compute-intensive Machine Learning (ML). Asynchrony implies speedup in some contexts, and challenges in others, as stale updates may lead to slower, or non-converging executions. While previous works showed asynchrony-adaptiveness can improve stability and speedup by reducing the step size for stale updates according to static rules, there is no one-size-fits-all adaptation rule, since the optimal strategy depends on several factors. We introduce (i) ASAP.SGD, an analytical framework capturing necessary and desired properties of staleness-adaptive step size functions and (ii) TAIL-T, a method for utilizing key properties of the execution instance, generating a tailored strategy that not only dampens the impact of stale updates, but also leverages fresh ones. We recover convergence bounds for adaptiveness functions satisfying the ASAP.SGD conditions, for general, convex and non-convex problems, and establish novel bounds for ones satisfying the Polyak-Lojasiewicz property. We evaluate TAIL-T with representative AsyncSGD concurrent algorithms, for Deep Learning problems, showing TAIL-T is a vital complement to AsyncSGD, with (i) persistent speedup in wall-clock convergence time in the parallelism spectrum, (ii) considerably lower risk of non-convergence, as well as (iii) precision levels for which original SGD implementations fail.
15.	Bäckström, Karl, 1994, et al. (författare) Consistent lock-free parallel stochastic gradient descent for fast and stable convergence 2021 Ingår i: Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021. ; , s. 423-432 Konferensbidrag (refereegranskat)abstract Stochastic Gradient Descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous shared-memory parallel SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for problems satisfying smooth, strongly convex targets, and gradient sparsity. Recent works take important steps towards understanding the potential of parallel SGD for problems not conforming to these strong assumptions, in particular for deep learning (DL). There is however a gap in current literature in understanding when AsyncSGD algorithms are useful in practice, and in particular how mechanisms for synchronization and consistency play a role. We contribute with answering questions in this gap by studying a spectrum of parallel algorithmic implementations ofAsyncSGD, aiming to understand how shared-data synchronization influences the convergence properties in fundamental DL applications. We focus on the impact of consistency-preserving non-blocking synchronization in SGD convergence, and in sensitivity to hyper-parameter tuning. We propose Leashed-SGD, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency. Leashed-SGD features a natural contention-regulating mechanism, as well as dynamic memory management, allocating space only when needed. We argue analytically about the dynamics of the algorithms, memory consumption, the threads' progress over time, and the expected contention. We provide a comprehensive empirical evaluation, validating the analytical claims, benchmarking the proposed Leashed-SGD framework, and comparing to baselines for two prominent deep learning (DL) applications: multilayer perceptrons (MLP) and convolutional neural networks (CNN). We observe the crucial impact of contention, staleness and consistency and show how, thanks to the aforementioned properties, Leashed-SGD provides significant improvements in stability as well as wall-clock time to convergence (from 20-80% up to 4 ×improvements) compared to the standard lock-based AsyncSGD algorithm and HOGWILD!, while reducing the overall memory footprint.
16.	Bäckström, Karl, 1994, et al. (författare) MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent 2019 Ingår i: Proceedings - 2019 IEEE International Conference on Big Data, Big Data 2019. ; , s. 16-25 Konferensbidrag (refereegranskat)abstract Stochastic Gradient Descent (SGD) is very useful in optimization problems with high-dimensional non-convex target functions, and hence constitutes an important component of several Machine Learning and Data Analytics methods. Recently there have been significant works on understanding the parallelism inherent to SGD, and its convergence properties. Asynchronous, parallel SGD (AsyncPSGD) has received particular attention, due to observed performance benefits. On the other hand, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence, stemming from the fact that the contribution of a thread might be based on an old (stale) view of the state. In this work we aim to deepen the understanding of AsyncPSGD in order to increase the statistical efficiency in the presence of stale gradients. We propose new models for capturing the nature of the staleness distribution in a practical setting. Using the proposed models, we derive a staleness-adaptive SGD framework, MindTheStep-AsyncPSGD, for adapting the step size in an online-fashion, which provably reduces the negative impact of asynchrony. Moreover, we provide general convergence time bounds for a wide class of staleness-adaptive step size strategies for convex target functions. We also provide a detailed empirical study, showing how our approach implies faster convergence for deep learning applications.
17.	Bäckström, Karl, 1994, et al. (författare) The Impact of Synchronization in Parallel Stochastic Gradient Descent 2022 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Cham : Springer International Publishing. - 1611-3349 .- 0302-9743. ; 13145 LNCS, s. 60-75 Konferensbidrag (refereegranskat)abstract In this paper, we discuss our and related work in the domain of efficient parallel optimization, using Stochastic Gradient Descent, for fast and stable convergence in prominent machine learning applications. We outline the results in the context of aspects and challenges regarding synchronization, consistency, staleness and parallel-aware adaptiveness, focusing on the impact on the overall convergence.
18.	Casado, Lander, 1985, et al. (författare) ContikiSec: A Secure Network Layer for Wireless Sensor Networks under the Contiki Operating System 2009 Ingår i: Proceedings of the 14th Nordic Conference on Secure IT Systems (NordSec 2009), Lecture Notes in Computer Science. - 1611-3349. - 9783642047657 ; 5838, s. 133 - 147 Konferensbidrag (refereegranskat)abstract In this paper we introduce ContikiSec, a secure network layer forwireless sensor networks, designed for the Contiki Operating System. ContikiSechas a configurable design, providing three security modes starting fromconfidentiality and integrity, and expanding to confidentiality, authentication,and integrity. ContikiSec has been designed to balance low energy consumptionand security while conforming to a small memory footprint. Our design wasbased on performance evaluation of existing security primitives and is part ofthe contribution of this paper. Our evaluation was performed in the ModularSensor Board hardware platform for wireless sensor networks, running Contiki.Contiki is an open source, highly portable operating system for wireless sensornetworks (WSN) that is widely used in WSNs.
19.	Casimiro, Antonio, et al. (författare) KARYON: Towards Safety Kernels for Cooperative Vehicular Systems 2012 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. - 9783642335358 ; 7596, s. 232-235 Konferensbidrag (refereegranskat)abstract KARYON, a kernel-based architecture for safety-critical control, is a European project that proposes a new perspective to improve performance of smart vehicle coordination focusing on Advanced Driver Assistance Systems (ADASs) and Unmanned Aerial Systems (UAS). The key objective is to provide system solutions for predictable and safe coordination of smart vehicles that autonomously cooperate and interact in an open and inherently uncertain environment. Currently, these systems are not allowed to operate on the public roads or in the air space, as the risk of causing severe damage cannot be excluded with sufficient certainty. The impact of the project is two-fold; it will provide improved vehicle density without driver involvement and increased traffic throughput to maintain mobility without a need to build new traffic infrastructures. The results will improve interaction in cooperation scenarios while preserving safety and assessing it according to standards. The prospective project results include self-stabilizing algorithms for vehicle coordination, communication and synchronization. In addition, we aim at showing that the safety kernel can be designed to be a self-stabilizing one.
20.	Cederman, Daniel, 1981, et al. (författare) A Practical Quicksort Algorithm for Graphics Processors 2008 Rapport (övrigt vetenskapligt/konstnärligt)abstract In this paper we describe GPU-Quicksort, an efficient Quicksort algorithm suitable for highly parallel multi-core graphics processors. Quicksort has previously been considered an inefficient sorting solution for graphics processors, but we show that in CUDA, NVIDIA's programming platform for general purpose computations on graphical processors, GPU-Quicksort performs better than the fastest known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors.
21.	Cederman, Daniel, 1981, et al. (författare) A Practical Quicksort Algorithm for Graphics Processors 2008 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. ; 5193, s. 246-258 Konferensbidrag (refereegranskat)abstract In this paper we present GPU-Quicksort, an efficient Quicksort algorithm suitable for highly parallel multi-core graphics processors. Quicksort has previously been considered as an inefficient sorting solution for graphics processors, but we show that GPU-Quicksort often performs better than the fastest known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors.
22.	Cederman, Daniel, 1981, et al. (författare) A Study of the Behavior of Synchronization Methods in Commonly Used Languages and Systems 2013 Ingår i: Proceedings of the 27th IEEE International Parallel & Distributed Processing Symposium. - 1530-2075. - 9780769549712 ; , s. 1309-1320 Konferensbidrag (refereegranskat)abstract Synchronization is a central issue in concurrency and plays an important role in the behavior and performance of modern programmes. Programming languages and hardware designers are trying to provide synchronization constructs and primitives that can handle concurrency and synchronization issues efficiently. Programmers have to find a way to select the most appropriate constructs and primitives in order to gain the desired behavior and performance under concurrency. Several parameters and factors affect the choice,through complex interactions among (i) the language and the language constructs that itsupports, (ii) the system architecture, (iii) possible run-time environments,virtual machine options and memory management support and(iv) applications.We present a systematic study ofsynchronizationstrategies, focusing on concurrent data structures.We have chosen concurrent data structures with different number of contention spots.We consider both coarse-grain and fine-grain locking strategies, as well as lock-free methods.We have investigated synchronization-aware implementations in C++, C# (.NET and Mono) and Java.Considering the machine architectures, we have studied the behavior of theimplementations on both Intel's Nehalem and AMD's Bulldozer.The properties that we study are throughput and fairness under different workloads and multiprogramming execution environments.For NUMA architectures fairness is becoming as important as the typically considered throughput property.To the best of our knowledge this is the first systematic and comprehensive study of synchronization-aware implementations.This paper takes steps towards capturing a number of guidingprinciples and concerns for the selection of the programming environmentand synchronization methods in connection to the application and the system characteristics.
23.	Cederman, Daniel, 1981, et al. (författare) Adapting Lock-Free Concurrent Data Objects to Support a Generic Move Operation 2012 Rapport (övrigt vetenskapligt/konstnärligt)abstract In the paper Supporting Lock-Free Composition of Concurrent Data Objects we introduced a methodology to compose the insert and remove operations of lock-free data structures. This allowed for the creation of move operations that can atomically transfer data from one data structure to another. In this report we apply the methodology to four different types of data structures; a queue, a stack, a hash-table and a skip-list. We first show that the data structures are compatible with the methodology. We then go through the changes needed to adapt them. Code listings are provided that presents the algorithms before and after modification.
24.	Cederman, Daniel, 1981, et al. (författare) Brief announcement: Concurrent data structures for efficient streaming aggregation 2014 Ingår i: Annual ACM Symposium on Parallelism in Algorithms and Architectures. - New York, NY, USA : ACM. - 9781450328210 ; , s. 76-78 Konferensbidrag (refereegranskat)abstract We briefly describe our study on the problem of streaming multiway aggregation [5], where large data volumes are received from multiple input streams. Multiway aggregation is a fundamental computational component in data stream management systems, requiring low-latency and high throughput solutions. We focus on the problem of designing concurrent data structures enabling for low-latency and highthroughput multiway aggregation; an issue that has been overlooked in the literature. We propose two new concurrent data structures and their lock-free linearizable implementations, supporting both order-sensitive and order-insensitive aggregate functions. Results from an extensive evaluation show significant improvement in the aggregation performance, in terms of both processing throughput and latency over the commonly-used techniques based on queues.
25.	Cederman, Daniel, 1981, et al. (författare) Concurrent Data Structures for Efficient Streaming Aggregation 2013 Rapport (övrigt vetenskapligt/konstnärligt)abstract In many data gathering applications, information arrives in the form of continuous streams rather than finite data sets.Efficient one-pass algorithms are required to cope with high input loads.Stream processing engines support continuous queries to process data in a real-time fashion and have evolved rapidly from centralized to distributed, parallel and elastic solutions.While a big effort has been put on leveraging the processing capacity of clusters of machines, less work has focused on leveraging the parallelism enabled by multi-core architectures by means of concurrent and lock-free data structures, to support the pipeline.This paper explores this aspect focusing on multiway aggregation, where large data volumes are received from multiple input streams.Multiway aggregation is crucial in contexts such as sensor networks, social media or clickstream analysis applications.We provide three enhanced aggregate operators that rely on two new concurrent data structures and their lock-free implementations, supporting both order-sensitive and order-insensitive aggregation functions.We provide an extensive study of the properties of the proposed aggregate operators and the new data structures.We also show an extensive experimental evaluation of the proposed methods, giving empirical evidence of their superiority.In this evaluation we run a variety of aggregation queries on two large datasets, one with data extracted from SoundCloud, a music social network, and one with data from a smart grid metering network.In all the experiments, the new data structures improved the aggregation performance significantly, up to one order of magnitude, in terms of both processing throughput and latency.
26.	Cederman, Daniel, 1981, et al. (författare) Dynamic Load Balancing using Work-Stealing 2012 Ingår i: GPU Computing Gems Jade Edition. ; , s. 485-499 Bokkapitel (övrigt vetenskapligt/konstnärligt)abstract In this chapter, we present a methodology for efficient load balancing of computational problems thatcan be easily decomposed into multiple tasks, but where it is hard to predict the computation cost ofeach task, and where new tasks are created dynamically during runtime. We present this methodologyand its exploitation and feasibility in the context of graphics processors. Work-stealing allows an idlecore to acquire tasks from a core that is overloaded, causing the total work to be distributed evenlyamong cores, while minimizing the communication costs, as tasks are only redistributed when required.This will often lead to higher throughput than using static partitioning.
27.	Cederman, Daniel, 1981, et al. (författare) GPU-Quicksort: A practical Quicksort algorithm for graphics processors 2009 Ingår i: Journal of Experimental Algorithmics. - 1084-6654. ; 14:4 Tidskriftsartikel (refereegranskat)abstract In this article, we describe GPU-Quicksort, an efficient Quicksort algorithm suitable for highly parallel multicore graphics processors. Quicksort has previously been considered an inefficient sorting solution for graphics processors, but we show that in CUDA, NVIDIA's programing platform for general-purpose computations on graphical processors, GPU-Quicksort performs better than the fastest-known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors.
28.	Cederman, Daniel, 1981, et al. (författare) Lock-Free Concurrent Data Structures 2017 Ingår i: Programming multi-core and many-core computing systems. - Hoboken, NJ, USA : John Wiley & Sons, Inc.. - 9781119332015 ; , s. 29-58 Bokkapitel (övrigt vetenskapligt/konstnärligt)abstract © 2017 by John Wiley & Sons, Inc. All rights reserved. Concurrent data structures are the data sharing side of parallel programming. An implementation of a data structure is called lock-free, if it allows multiple processes/hreads to access the data structure concurrently and also guarantees that at least one operation among those finishes in a finite number of its own steps regardless of the state of the other operations. This chapter provides a sufficient background and intuition to help the interested reader to navigate in the complex research area of lock-free data structures. It offers the programmer familiarity to the subject that allows using truly concurrent methods. The chapter discusses the fundamental synchronization primitives on which efficient lock-free data structures rely. It discusses the problem of managing dynamically allocated memory in lock-free concurrent data structures and general concurrent environments. The idiosyncratic architectural features of graphics processors that is important to consider when designing efficient lock-free concurrent data structures for this emerging area.
29.	Cederman, Daniel, 1981, et al. (författare) Lock-free Concurrent Data Structures 2013 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract Concurrent data structures are the data sharing side of parallel programming. Data structures give the means to the program to store data, but also provide operations to the program to access and manipulate these data. These operations are implemented through algorithms that have to be efficient. In the sequential setting, data structures are crucially important for the performance of the respective computation. In the parallel programming setting, their importance becomes more crucial because of the increased use of data and resource sharing for utilizing parallelism. The first and main goal of this chapter is to provide a sufficient background and intuition to help the interested reader to navigate in the complex research area of lock-free data structures. The second goal is to offer the programmer familiarity to the subject that will allow her to use truly concurrent methods.
30.	Cederman, Daniel, 1981, et al. (författare) On Dynamic Load Balancing on Graphics Processors 2008 Ingår i: Proceedings of the 23rd SIGGRAPH/Eurographics Conference on Graphics Hardware. - 1727-3471. - 9783905674095 ; 2008, s. 57-64 Konferensbidrag (refereegranskat)abstract To get maximum performance on the many-core graphics processorsit is important to have an even balance of the workload so thatall processing units contribute equally to the task at hand.This can be hard to achieve when the cost of a task is notknown beforehand and when new sub-tasks are created dynamicallyduring execution. With the recent advent of scatter operationsand atomic hardware primitives it is now possible to bring someof the more elaborate dynamic load balancing schemes from theconventional SMP systems domain to the graphics processordomain.We have compared four different dynamic load balancing methodsto see which one is most suited to the highly parallel world ofgraphics processors. Three of these methods were lock-free andone was lock-based. We evaluated them on the task of creatingan octree partitioning of a set of particles. The experimentsshowed that synchronization can be very expensive and that newmethods that take more advantage of the graphics processorsfeatures and capabilities might be required. They also showedthat lock-free methods achieves better performance thanblocking and that they can be made to scale with increasednumbers of processing units.
31.	Cederman, Daniel, 1981, et al. (författare) On Sorting and Load-Balancing on GPUs 2008 Ingår i: ACM SIGARCH Computer Architecture News. ; 36:5 Tidskriftsartikel (refereegranskat)abstract In this paper we present GPU-Quicksort, an efficientQuicksort algorithm suitable for highly parallel multi-coregraphics processors. Quicksort has previously been consideredas an inefficient sorting solution for graphics processors,but we show that GPU-Quicksort often performs betterthan the fastest known sorting implementations for graphicsprocessors, such as radix and bitonic sort. Quicksortcan thus be seen as a viable alternative for sorting largequantities of data on graphics processors.We also present a comparison of different load balancingschemes. To get maximum performance on the manycoregraphics processors it is important to have an evenbalance of the workload so that all processing units contributeequally to the task at hand. This can be hard toachieve when the cost of a task is not known beforehandand when new sub-tasks are created dynamically during execution.With the recent advent of scatter operations andatomic hardware primitives it is now possible to bring someof the more elaborate dynamic load balancing schemes fromthe conventional SMP systems domain to the graphics processordomain.
32.	Cederman, Daniel, 1981, et al. (författare) Supporting Lock-Free Composition of Concurrent Data Objects 2010 Ingår i: SIGPLAN Notices (ACM Special Interest Group on Programming Languages). - : Association for Computing Machinery (ACM). - 0730-8566 .- 0362-1340 .- 1558-1160. ; 45:5, s. 339-340 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract Lock-free data objects offer several advantages over their blocking counterparts, such as being immune to deadlocks and convoying and, more importantly, being highly concurrent. But they share a common disadvantage in that the operations they provide are difficult to compose into larger atomic operations while still guaranteeing lock-freedom. We present a lock-free methodology for composing highly concurrent linearizable objects together by unifying their linearization points. This makes it possible to relatively easily introduce atomic lock-free move operations to a wide range of concurrent objects. Experimental evaluation has shown that the operations originally supported by the data objects keep their performance behavior under our methodology.
33.	Cederman, Daniel, 1981, et al. (författare) Supporting Lock-Free Composition of Concurrent Data Objects 2010 Ingår i: Proceedings of the 7th ACM conference on Computing frontiers. - 9781450300445 ; , s. 53-62 Konferensbidrag (refereegranskat)abstract Lock-free data objects offer several advantages over their blocking counterparts, such as beingimmune to deadlocks and convoying and, more importantly, being highly concurrent.However, composing the operations they provide into larger atomic operations, while still guaranteeingefficiency and lock-freedom, is a challenging algorithmic task.We present a lock-freemethodology for composing highly concurrent linearizable objects together by unifying their linearization points.This makes it possible to relatively easily introduce atomic lock-free move operations to a wide rangeof concurrent objects. Experimental evaluation has shown that the operations originally supported bythe data objects keep their performance behavior under our methodology.
34.	Cederman, Daniel, 1981, et al. (författare) Supporting Lock-Free Composition of Concurrent Data Objects 2009 Rapport (övrigt vetenskapligt/konstnärligt)abstract Lock-free data objects offer several advantages over their blocking counterparts, such as being immune to deadlocks and convoying and, more importantly, being highly concurrent. But they share a common disadvantage in that the operations they provide are difficult to composeinto larger atomic operations while still guaranteeing lock-freedom. We present a lock-free methodology for composing highly concurrent linearizable objects together by unifying their linearization points. This makes it possible to relatively easily introduce atomic lock-free move operations to a wide range of concurrent objects. Experimental evaluation has shown that the operations originally supported by the data objects keep their performance behavior under our methodology.
35.	Cederman, Daniel, 1981, et al. (författare) Supporting Lock-Free Composition of Concurrent Data Objects 2010 Ingår i: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming. - New York, NY, USA : ACM. - 9781605587080 ; , s. 339-340 Konferensbidrag (refereegranskat)abstract Lock-free data objects offer several advantages over their blocking counterparts, such as being immune to deadlocks and convoying and, more importantly, being highly concurrent. But they share a common disadvantage in that the operations they provide are difficult to compose into larger atomic operations while still guaranteeing lock-freedom. We present a lock-free methodology for composing highly concurrent linearizable objects together by unifying their linearization points. This makes it possible to relatively easily introduce atomic lock-free move operations to a wide range of concurrent objects. Experimental evaluation has shown that the operations originally supported by the data objects keep their performance behavior under our methodology.
36.	Cederman, Daniel, 1981, et al. (författare) Supporting Lock-Free Composition of Concurrent Data Objects 2010 Ingår i: MCC10 Proceedings. Konferensbidrag (refereegranskat)abstract We present a lock-free methodology for composing highly concurrent linearizable objects together by unifying their linearization points. This makes it possible to relatively easily introduce atomic lock-free move operations to a wide range of concurrent objects. Experimental evaluation has shown that the operations originally supported by the data objects keep their performance behavior under our methodology.
37.	Cederman, Daniel, 1981, et al. (författare) Supporting Lock-Free Composition of Concurrent Data Objects: Moving Data Between Containers 2013 Ingår i: IEEE Transactions on Computers. - 0018-9340. ; 62:9, s. 1866-1878 Tidskriftsartikel (refereegranskat)abstract Lock-free data objects offer several advantages over their blocking counterparts, such as being immune to deadlocks, priority inversion and convoying. They have also been shown to work well in practice. However, composing the operations they provide into larger atomic operations, while still guaranteeing efficiency and lock-freedom, is a challenging algorithmic task. We present a lock-free methodology for composing a wide variety of concurrent linearizable objects together by unifying their linearization points. This makes it possible to relatively easily introduce atomic lock-free move operations to a wide range of concurrent lock-free containers. This move operation allows data to be transferred from one container to another, in a lock-free way, without blocking any of the operations supported by the original container. For a data object to be suitable for composition using our methodology it needs to fulfil a set of requirements. These requirement are however generic enough to be fulfilled by a large set of objects. To show this we have performed case studies on six commonly used lock-free objects (a stack, a queue, a skip-list, a deque, a doubly linked-list and a hash-table) to demonstrate the general applicability of the methodology. We also show that the operations originally supported by the data objects keep their performance behavior under our methodology.
38.	Cederman, Daniel, 1981, et al. (författare) Towards a Software Transactional Memory for CUDA 2009 Ingår i: MCC09 Proceedings. Konferensbidrag (refereegranskat)abstract The introduction of CUDA, NVIDIA's system for general purpose computing on their many-core graphics processorsystem, and the general shift in the industry towards parallelism, has created a demand for ease of parallelization.Software transactional memory (STM) simplifies development of concurrent code by allowing theprogrammer to mark sections of code to be executed atomically. The STM will then guarantee that otherprocesses will see either none or all of the writes done in in that section. In contrast to using locks,STM:s are easy to compose and does not suffer from deadlocks. An STM can thus be seen as a concurrency control mechanism.In this paper we report on our work towards implementing a simple software transactional memory in CUDA.
39.	Cederman, Daniel, 1981, et al. (författare) Towards a Software Transactional Memory for Graphics Processors 2010 Ingår i: Proceedings of the Eurographics Symposium on Parallel Graphics and Visualization 2010. Konferensbidrag (refereegranskat)abstract The introduction of general purpose computing on many-core graphics processorsystems, and the general shift in the industry towards parallelism, has created a demand for ease of parallelization.Software transactional memory (STM) simplifies development of concurrent code by allowing theprogrammer to mark sections of code to be executed concurrently and atomically in an optimistic manner.In contrast to locks,STMs are easy to compose and do not suffer from deadlocks.We have designed and implemented two STMs for graphics processors, one blocking and one non-blocking.The design issues involved in the designing of these two STMs are described andexplained in the paper together with experimental results comparing the performance of the two STMs.
40.	Cederman, Daniel, 1981, et al. (författare) Understanding the Performance of Concurrent Data Structures on Graphics Processors 2012 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. - 9783642328190 ; 7484/2012, s. 883-894 Konferensbidrag (refereegranskat)abstract In this paper we revisit the design of concurrent data structures -- specifically queues -- and examine their performance portabilitywith regard to the move from conventional CPUs to graphics processors. We have looked at both lock-based and lock-free algorithmsand have, for comparison, implemented and optimized the same algorithms on both graphics processors and multi-core CPUs.Particular interest has been paid to study the difference between the old Tesla and the new Fermi and Kepler architecturesin this context.We provide a comprehensive evaluation and analysis of our implementations on all examined platforms.Our results indicate that the queues are in general performance portable, but that platform specific optimizations are possibleto increase performance. The Fermi and Kepler GPUs, with optimized atomic operations, are observed to provide excellent scalabilityfor both lock-based and lock-free queues.
41.	Chatterjee, Bapi, 1982, et al. (författare) Concurrent Linearizable Nearest Neighbour Search in LockFree-kD-tree 2015 Rapport (övrigt vetenskapligt/konstnärligt)abstract The Nearest neighbour search (NNS) is an important problem in a large number of application domains dealing with multidimensional data. In concurrent settings, where dynamic modi?cations are allowed, a linearizable implementation of NNS is highly desirable to discover the latest nearest neighbour of a given target data-point. In this paper, we introduce the LockFree-kD-tree (LFkD-tree): a lock-free concurrent kD-tree, which implements an abstract data type (ADT) that provides the operations Add, Remove, Contains, and NNS. Our implementation is linearizable. The operations in the LFkD-tree use single-word read and compare-and-swap (CAS) atomic primitives, which are readily supported on commonly available multi-core processors. We experimentally evaluate the LFkD-tree using several benchmarks comprising real-world and synthetic datasets. The experiments show that the presented design is scalable and achieves significant speed-up compared to the implementations of an existing sequential kD-tree and a recently proposed multidimensional indexingstructure, PH-tree.
42.	Chatterjee, Bapi, 1982, et al. (författare) Concurrent linearizable nearest neighbour search in lockfree-kd-Tree 2018 Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : ACM. ; Part F133180 Konferensbidrag (refereegranskat)abstract The Nearest neighbour search (NNS) is a fundamental problem in many application domains dealing with multidimensional data. In a concurrent setting, where dynamic modi-fications are allowed, a linearizable implementation of NNS is highly desirable. This paper introduces the LockFree-kD-Tree (LFkD-Tree): A lock-free concurrent kD-Tree, which implements an abstract data type (ADT) that provides the operations Add, Remove, Contains, and NNS. Our implementation is linearizable. The operations in the LFkD-Tree use single-word read and compare-And-swap (CAS) atomic primitives, which are readily supported on available multi-core processors. We experimentally evaluate the LFkD-Tree using several benchmarks comprising real-world and synthetic datasets. The experiments show that the presented design is scalable and achieves signi cant speed-up compared to the implementations of an existing sequential kD-Tree and a recently proposed multidimensional indexing structure, PH-Tree. © 2018 Copyright held by the owner/author(s).
43.	Chatterjee, Bapi, 1982, et al. (författare) Concurrent linearizable nearest neighbour search in LockFree-kD-tree 2021 Ingår i: Theoretical Computer Science. - : Elsevier BV. - 0304-3975. ; 886, s. 27-48 Tidskriftsartikel (refereegranskat)abstract The Nearest neighbour search (NNS) is a fundamental problem in many application domains dealing with multidimensional data. In a concurrent setting, where dynamic modifications are allowed, a linearizable implementation of the NNS is highly desirable. This paper introduces the LockFree-kD-tree (LFkD-tree ): a lock-free concurrent kD-tree, which implements an abstract data type (ADT) that provides the operations Add, Remove, Contains, and NNS. Our implementation is linearizable. The operations in the LFkD-tree use single-word read and compare-and-swap ([Formula presented] ) atomic primitives, which are readily supported on available multi-core processors. We experimentally evaluate the LFkD-tree using several benchmarks comprising real-world and synthetic datasets. The experiments show that the presented design is scalable and achieves significant speed-up compared to the implementations of an existing sequential kD-tree and a recently proposed multidimensional indexing structure, PH-tree.
44.	Chatterjee, Bapi, 1982, et al. (författare) Efficient lock-free binary search trees 2014 Ingår i: 2014 ACM Symposium on Principles of Distributed Computing, PODC 2014; Paris; France; 15 July 2014 through 18 July 2014. - New York, NY, USA : ACM. - 9781450329446 ; , s. 322-331 Konferensbidrag (refereegranskat)abstract In this paper we present a novel algorithm for concurrent lock-free internal binary search trees (BST) and implement a Set abstract data type (ADT) based on that. We show that in the presented lock-free BST algorithm the amortized step complexity of each set operation - ADD, REMOVE and CONTAINS - is O(H(n) + c), where H(n) is the height of the BST with n number of nodes and c is the contention during the execution. Our algorithm adapts to contention measures according to read-write load. If the situation is read-heavy, the operations avoid helping the concurrent REMOVE operations during traversal, and adapt to interval contention. However, for the write-heavy situations we let an operation help a concurrent REMOVE, even though it is not obstructed. In that case, an operation adapts to point contention. It uses single-word compare-and-swap (CAS) operations. We show that our algorithm has improved disjoint-access-parallelism compared to similar existing algorithms. We prove that the presented algorithm is linearizable. To the best of our knowledge, this is the first algorithm for any concurrent tree datastructure in which the modify operations are performed with an additive term of contention measure.
45.	Chatterjee, Bapi, 1982, et al. (författare) Help-optimal and Language-portable Lock-free Concurrent Data Structures 2016 Rapport (övrigt vetenskapligt/konstnärligt)abstract Helping is the most common mechanism to guarantee lock-freedom in many concurrent data structures. An optimized helping strategy improves the overall performance of a lock-free algorithm. In this paper, we propose help-optimality, which essentially implies that no operationstep is accounted for exclusive helping in the lock-free synchronization of concurrent operations. To describe the concept, we revisit the designs of a lock-free linked-list and a lock-free binary search tree and present improved algorithms. Our algorithms employ atomic single-word compare-and-swap (CAS) primitives and are linearizable.Additionally, we do not use a language/platform speci?c mechanism to modulate helping, speci?cally, we use neither bit-stealing from a pointer nor runtime type introspection of objects, making the algorithms language-portable. Further, to optimize the amortized numberof steps per operation, if a CAS execution to modify a shared pointer fails, we obtain a fresh set of thread-local variables without restarting an operation from scratch.We use several micro-benchmarks in both C/C++ and Java to validate the e?ciency of our algorithms against existing state-of-the-art. The experiments show that the algorithms are scalable. Our implementations perform on a par with highly optimized ones and in manycases yield 10%-50% higher throughput.
46.	Chatterjee, Bapi, 1982, et al. (författare) Help-Optimal and Language-Portable Lock-Free Concurrent Data Structures 2016 Ingår i: 45th International Conference on Parallel Processing (ICPP), 2016. - 0190-3918. - 9781509028238 ; 2016 september, s. 360-369 Konferensbidrag (refereegranskat)abstract Helping is a widely used technique to guarantee lock-freedom in many concurrent data structures. An optimized helping strategy improves the overall performance of a lock-free algorithm. In this paper, we propose help-optimality, which essentially implies that no operation step is accounted for exclusive helping in the lock-free synchronization of concurrent operations. To describe the concept, we revisit the designs of a lock-free linked-list and a lock-free binary search tree and present improved algorithms. Our algorithms employ atomic single-word compare-and-swap (CAS) primitives and are linearizable. We design the algorithms without using any language/platformspecific mechanism. Specifically, we use neither bit-stealing froma pointer nor runtime type introspection of objects. Thus, our algorithms are language-portable. Further, to optimize the amortized number of steps per operation, if a CAS execution tomodify a shared pointer fails, we obtain a fresh set of thread-local variables without restarting an operation from scratch. We use several micro-benchmarks in both C/C++ and Java to validate the efficiency of our algorithms against existing state-of-the-art. The experiments show that the algorithms are scalable. Our implementations perform on a par with highly optimizedones and in many cases yield 10%-50% higher throughput.
47.	Damaschke, Peter, 1963, et al. (författare) Competitive Freshness Algorithms for Wait-free Data Objects 2006 Ingår i: Proceedings of the 12th European Conference on Parallel Computing (EURO-PAR '06), LNCS, Springer Verlag. - 3540377832 ; 4128, s. 811-820 Tidskriftsartikel (refereegranskat)
48.	Damaschke, Peter, 1963, et al. (författare) Competitive Freshness Algorithms for Wait-free Data Objects 2005 Rapport (övrigt vetenskapligt/konstnärligt)abstract Wait-free concurrent data objects are widely used in multiprocessor systems and real-time systems. Their popularity results from the fact that they avoid locking and that concurrent operations on such data objects are guaranteed to finish in a bounded number of steps regardless of the other operations interference. The data objects allow high access parallelism and guarantee correctness of the concurrent access with respect to its semantics. In such a highly-concurrent environment, where write-operations update the object state concurrently with read-operations, the age/freshness of the state returned by a read-operation is a significant measure of the object quality.In this paper, we first propose a freshness measure for wait-free concurrent data objects. Subsequently, we model the freshness problem as an online problem and present two algorithms for the problem. The first one is an optimal deterministic algorithm with freshness competitive ratio $\sqrt{\alpha}$, where $\alpha$ is a function of execution-time upper-bound of wait-free operations. The second one is a competitive randomized algorithm with freshness competitive ratio $\frac{\ln \alpha}{1 + \ln 2 - \frac{2}{\sqrt{\alpha}}}$.
49.	Damaschke, Peter, 1963, et al. (författare) One-Way Trading with Time-Varying Exchange Rate Bounds 2005 Rapport (övrigt vetenskapligt/konstnärligt)abstract One-way trading is a basic online problem in finance. Since its optimal solution is given by a simple formula (however with difficult analysis), the problem is attractive as a target to which other practical online problems can betransformed. However, there are still natural online problems that do not fit in the known variants of one-way trading. We present some new models where the bounds of exchange rates are not constant but vary with time in certain ways. The first model, where the (logarithmic) exchange rate has limited decay speed, arises from an issue in distributed data structures, namely to maximize the(appropriately defined) freshness of values in concurrent objects. For this model we prove a lower bound on the competitive ratio which is nearly optimal, i.e., upto a small constant factor. Clearly, the lower bound holds also against stronger adversaries. Then, we give an optimal algorithm in a model where only the maximal allowed exchange rate decreases with time, but the actual exchange rates may jump arbitrarily within this frame. We have chosen this more powerful adversary model afterwards because some applications does not make use of the limited decay speed. In fact, we adapt the optimal threat-based algorithm of El-Yaniv et al. (2001) to our case. Our numerical experiments suggest that this algorithm is still not too far from the lower bound for the weaker adversary. This isexplained by the observation that slowly increasing exchange rates seem to be the worst case for the online player.
50.	Damaschke, Peter, 1963, et al. (författare) Online search with time-varying price bounds 2009 Ingår i: Algorithmica. - : Springer Science and Business Media LLC. - 0178-4617 .- 1432-0541. ; 55:4, s. 619-642 Tidskriftsartikel (refereegranskat)abstract Online search is a basic online problem. The fact that its optimal deterministic/randomized solutions are given by simple formulas (however with difficult analysis) makes the problem attractive as a target to which other practical online problems can be transformed to find optimal solutions. However, since the upper/lower bounds of prices in available models are constant, natural online problems in which these bounds vary with time do not fit in the available models. We present two new models where the bounds of prices are not constant but vary with time in certain ways. The first model, where the upper and lower bounds of (logarithmic) prices have decay speed, arises from a problem in concurrent data structures, namely to maximize the (appropriately defined) freshness of data in concurrent objects. For this model we present an optimal deterministic algorithm and a nearly-optimal randomized algorithm. We also prove a lower bound of competitive ratios of randomized algorithms. The second model is inspired by the fact that some applications do not utilize the decay speed of the lower bound of prices in the first model. In the second model, only the upper bound decreases arbitrarily with time and the lower bound is constant. Clearly, the lower bound of competitive ratios proved for the first model holds also against the stronger adversary in the second model. For the second model, we present an optimal randomized algorithm. Our numerical experiments on the freshness problem show that this new algorithm achieves much better/smaller competitive ratios than previous algorithms do.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Träfflista för sökning "WFRF:(Tsigas Philippas 1967) "

Avgränsa träffmängd

År