SwePub - sökning: WFRF:(Ha Phuong 1976)

Numrering	Referens	Omslagsbild	Hitta
1.	Cederman, Daniel, 1981, et al. (författare) Lock-free Concurrent Data Structures 2013 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract Concurrent data structures are the data sharing side of parallel programming. Data structures give the means to the program to store data, but also provide operations to the program to access and manipulate these data. These operations are implemented through algorithms that have to be efficient. In the sequential setting, data structures are crucially important for the performance of the respective computation. In the parallel programming setting, their importance becomes more crucial because of the increased use of data and resource sharing for utilizing parallelism. The first and main goal of this chapter is to provide a sufficient background and intuition to help the interested reader to navigate in the complex research area of lock-free data structures. The second goal is to offer the programmer familiarity to the subject that will allow her to use truly concurrent methods.
2.	Damaschke, Peter, 1963, et al. (författare) Competitive Freshness Algorithms for Wait-free Data Objects 2006 Ingår i: Proceedings of the 12th European Conference on Parallel Computing (EURO-PAR '06), LNCS, Springer Verlag. - 3540377832 ; 4128, s. 811-820 Tidskriftsartikel (refereegranskat)
3.	Damaschke, Peter, 1963, et al. (författare) Competitive Freshness Algorithms for Wait-free Data Objects 2005 Rapport (övrigt vetenskapligt/konstnärligt)abstract Wait-free concurrent data objects are widely used in multiprocessor systems and real-time systems. Their popularity results from the fact that they avoid locking and that concurrent operations on such data objects are guaranteed to finish in a bounded number of steps regardless of the other operations interference. The data objects allow high access parallelism and guarantee correctness of the concurrent access with respect to its semantics. In such a highly-concurrent environment, where write-operations update the object state concurrently with read-operations, the age/freshness of the state returned by a read-operation is a significant measure of the object quality.In this paper, we first propose a freshness measure for wait-free concurrent data objects. Subsequently, we model the freshness problem as an online problem and present two algorithms for the problem. The first one is an optimal deterministic algorithm with freshness competitive ratio $\sqrt{\alpha}$, where $\alpha$ is a function of execution-time upper-bound of wait-free operations. The second one is a competitive randomized algorithm with freshness competitive ratio $\frac{\ln \alpha}{1 + \ln 2 - \frac{2}{\sqrt{\alpha}}}$.
4.	Damaschke, Peter, 1963, et al. (författare) One-Way Trading with Time-Varying Exchange Rate Bounds 2005 Rapport (övrigt vetenskapligt/konstnärligt)abstract One-way trading is a basic online problem in finance. Since its optimal solution is given by a simple formula (however with difficult analysis), the problem is attractive as a target to which other practical online problems can betransformed. However, there are still natural online problems that do not fit in the known variants of one-way trading. We present some new models where the bounds of exchange rates are not constant but vary with time in certain ways. The first model, where the (logarithmic) exchange rate has limited decay speed, arises from an issue in distributed data structures, namely to maximize the(appropriately defined) freshness of values in concurrent objects. For this model we prove a lower bound on the competitive ratio which is nearly optimal, i.e., upto a small constant factor. Clearly, the lower bound holds also against stronger adversaries. Then, we give an optimal algorithm in a model where only the maximal allowed exchange rate decreases with time, but the actual exchange rates may jump arbitrarily within this frame. We have chosen this more powerful adversary model afterwards because some applications does not make use of the limited decay speed. In fact, we adapt the optimal threat-based algorithm of El-Yaniv et al. (2001) to our case. Our numerical experiments suggest that this algorithm is still not too far from the lower bound for the weaker adversary. This isexplained by the observation that slowly increasing exchange rates seem to be the worst case for the online player.
5.	Damaschke, Peter, 1963, et al. (författare) Online search with time-varying price bounds 2009 Ingår i: Algorithmica. - : Springer Science and Business Media LLC. - 0178-4617 .- 1432-0541. ; 55:4, s. 619-642 Tidskriftsartikel (refereegranskat)abstract Online search is a basic online problem. The fact that its optimal deterministic/randomized solutions are given by simple formulas (however with difficult analysis) makes the problem attractive as a target to which other practical online problems can be transformed to find optimal solutions. However, since the upper/lower bounds of prices in available models are constant, natural online problems in which these bounds vary with time do not fit in the available models. We present two new models where the bounds of prices are not constant but vary with time in certain ways. The first model, where the upper and lower bounds of (logarithmic) prices have decay speed, arises from a problem in concurrent data structures, namely to maximize the (appropriately defined) freshness of data in concurrent objects. For this model we present an optimal deterministic algorithm and a nearly-optimal randomized algorithm. We also prove a lower bound of competitive ratios of randomized algorithms. The second model is inspired by the fact that some applications do not utilize the decay speed of the lower bound of prices in the first model. In the second model, only the upper bound decreases arbitrarily with time and the lower bound is constant. Clearly, the lower bound of competitive ratios proved for the first model holds also against the stronger adversary in the second model. For the second model, we present an optimal randomized algorithm. Our numerical experiments on the freshness problem show that this new algorithm achieves much better/smaller competitive ratios than previous algorithms do.
6.	Ha, Phuong, 1976, et al. (författare) Brief Announcement: Wait-free Programming for General Purpose Computations on Graphics Processors 2008 Ingår i: Proceedings of the twenty-seventh ACM symposium on Principles of distributed computing. - 9781595939890 ; , s. 452- Konferensbidrag (refereegranskat)abstract This paper aims at bridging the gap between the lack of synchronization mechanisms in recent graphics processor (GPU) architectures and the need of synchronization mechanisms in parallel applications. Based on the intrinsic features of recent GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-and-set and compare-and-swap. Accesses to the new wait-free objects have time complexity O(N), where N is the number of concurrent processes. The wait-free objects have space complexity O(N2), which is optimal. Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.
7.	Ha, Phuong, 1976, et al. (författare) Efficient Multi-Word Locking Using Randomization 2005 Ingår i: Proceedings of the 24th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC '05). ; , s. 249 - 257 Konferensbidrag (refereegranskat)
8.	Ha, Phuong, 1976, et al. (författare) Efficient Self-tuning Spin-locks Using Competitive Analysis 2007 Ingår i: Journal of Systems and Software. - 0164-1212. Tidskriftsartikel (refereegranskat)abstract Reactive spin-lock algorithms that can automatically adapt to contention variation on the lock have received great attention in the eld of multiprocessor synchronization since they can help applications achieve good performance in all possible contention conditions. However, in existing reactive spin-locks the reaction relies on (i) some fixed experimentally tuned thresholds, which may get frequently inappropriate in dynamic environments like multiprogramming/multiprocessor systems, or (ii) known probability distributions of inputs. This paper presents a new reactive spin-lock algorithm that is completely selftuning, which means no experimentally tuned parameter nor probability distribution of inputs are needed. The new spin-lock is based on both synchronization structures of applications and a competitive online algorithm. Our experiments, which use the Spark98 kernels and the SPLASH-2 applications as application benchmarks, on a multiprocessor machine SGI Origin2000 and an Intel Xeon workstation have showed that the new self-tuning spin-lock performs as well as the best of hand-tuning spin-lock representatives in a wide range of contention levels.
9.	Ha, Phuong, 1976, et al. (författare) Fast, reactive and lock-free multi-word compare-and-swap algorithms 2003 Rapport (övrigt vetenskapligt/konstnärligt)
10.	Ha, Phuong, 1976, et al. (författare) LYDIAN: User's Guide 2005 Rapport (övrigt vetenskapligt/konstnärligt)abstract One area in which visualization techniques may be applied to enhance understanding of computer systems is the field of distributed algorithms. Lydian is a simulation and visualization environment for distributed algorithms that provides to the students an experimental environment to test and visualize the behaviour of distributed algorithms. It gives to the students the ability to create easily their own experiments and visualize the behaviour of the algorithms on their experiments. Lydian is easy to use and can also be extended.
11.	Ha, Phuong, 1976, et al. (författare) NB-FEB: A Universal Scalable Easy-to-Use Synchronization Primitive for Manycore Architectures 2009 Ingår i: Proceedings of the 13th International Conference on Principle of Distributed Systems (OPODIS 2009), Lecture Notes in Computer Science. - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349. - 9783642108761 ; 5923, s. 189-203 Konferensbidrag (refereegranskat)abstract his paper addresses the problem of universal synchronization primitives that can support scalable thread synchronization for large-scale manycore architectures. The universal synchronization primitives that have been deployed widely in conventional architectures, are the compare-and-swap (CAS) and load-linked/store-conditional (LL/SC) primitives. However, such synchronization primitives are expected to reach their scalability limits in the evolution to manycore architectures with thousands of cores.We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on manycore architectures. We show that the NB-FEB primitive is universal, scalable, feasible and easy to use. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently makes NB-FEB scalable (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility). We construct, on top of NB-FEB, a non-blocking software transactional memory system called NBFEB-STM, which can be used as an abstraction to handle concurrent threads easily. NBFEB-STM is space efficient: the space complexity of each object updated by N concurrent threads/transactions is ${\it \Theta}(N)$, which is optimal.
12.	Ha, Phuong, 1976, et al. (författare) NB-FEB: An Easy-to-Use and Scalable Universal Synchronization Primitive for Parallel Programming 2008 Rapport (övrigt vetenskapligt/konstnärligt)abstract This paper addresses the problem of universal synchronizationprimitives that can support scalable thread synchronizationfor large-scale many-core architectures. The universalsynchronization primitives that have been deployed widelyin conventional architectures, are the compare-and-swap (CAS)and load-linked/store-conditional (LL/SC) primitives. However,such synchronization primitives are expected to reachtheir scalability limits in the evolution to many-core architectureswith thousands of cores.We introduce a non-blocking full/empty bit primitive, orNB-FEB for short, as a promising synchronization primitivefor parallel programming on may-core architectures. We showthat the NB-FEB primitive is universal, scalable, feasible andconvenient to use. NB-FEB, together with registers, can solvethe consensus problem for an arbitrary number of processes(universality). NB-FEB is combinable, namely its memory requeststo the same memory location can be combined intoonly one memory request, which consequently mitigates performancedegradation due to synchronization "hot spots" (scalability).Since NB-FEB is a variant of the original full/emptybit that always returns a value instead of waiting for a conditionalflag, it is as feasible as the original full/empty bit, whichhas been implemented in many computer systems (feasibility).The original full/empty bit is well-known as a special-purposeprimitive for fast producer-consumer synchronization and hasbeen used extensively in the specific domain of applications.In this paper, we show that NB-FEB can be deployed easilyas a general-purpose primitive. Using NB-FEB, we constructa non-blocking software transactional memory systemcalled NBFEB-STM, which can be used to handle concurrentthreads conveniently. NBFEB-STM is space efficient:the space complexity of each object updated by N concurrentthreads/transactions is Θ(N), the optimal.
13.	Ha, Phuong, 1976, et al. (författare) Non-blocking programming on multi-core graphics processors 2009 Ingår i: SIGARCH Computer Architecture News. - 0163-5964. ; 36:5, s. 19-28 Tidskriftsartikel (refereegranskat)abstract This paper investigates the synchronization power of coalesced memory accesses, a family of memory access mechanisms introduced in recent large multicore architectures like the CUDA graphics processors. We first design three memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models in terms of their consensus numbers. These tight results show that the coalesced memory access mechanisms can facilitate strong synchronization between the threads of multicore processors, without the need of synchronization primitives other than reads and writes.Moreover, based on the intrinsic features of recent GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-and-set and compare-and-swap. Accesses to the wait-free objects have time complexity O(N), where N is the number of processes. Our result demonstrates that it is possible to construct waitfree synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.
14.	Ha, Phuong, 1976, et al. (författare) Preliminary results on nb-feb, a synchronization primitive for parallel programming 2009 Ingår i: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming. - New York, NY, USA : ACM. - 9781605583976 ; , s. 295-296 Konferensbidrag (refereegranskat)abstract We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, scalable and feasible. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility).
15.	Ha, Phuong, 1976 (författare) Reactive Concurrent Data Structures and Algorithms for Synchronization 2006 Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract Parallelism plays a significant role in high-performance computing systems, from large clusters of computers to chip-multithreading (CMT) processors. Performance of the parallel systems comes not only from concurrently runningmore processing hardware but also from utilizing the hardware efficiently. The hardware utilization is strongly influenced by how processors/processes are synchronized in the system to maximize parallelism. Synchronization between concurrent processes usually relies on shared data structures. The data structures that enhance parallelism by allowing processes to access them concurrently are known as "concurrent data structures". The thesis aims at developing efficient concurrent data structures and algorithms for synchronization in asynchronous shared-memory multiprocessors.Generally speaking, simple data structures perform well in the absence of contention but perform poorly in high-contention situations. Contrarily, sophisticated data structures that can scale and perform well in the presence of high contention usually suffer unnecessary high latency when there is no contention. Efficient concurrent data structures should be able to adapt their algorithmic complexity to varying contention.This has motivated us to develop fundamental concurrent data structures like trees, multi-word compare-and-swap and locks into reactive ones that timely adapt their size or algorithmic behavior to the contention level in execution environments. While the contention is varying rapidly, the reactive data structures must keep the cost of reaction below its benefit, avoiding unnecessary reaction due to the contention oscillation. This is quite challenging since the information of how the contention will vary in the future is usually not available in multiprogramming multiprocessor environments. To deal with the uncertainty, we have successfully synthesized non-blocking synchronization techniques and advanced on-line algorithmic techniques, in the context of reactive concurrent data structures.On the other hand, we have developed a new optimal on-line algorithm for one-way trading with time-varying exchange-rate bounds. The algorithm extends the set of practical problems that can be transformed to the one-way trading so as to find an optimal solution. In this thesis, the new algorithm demonstrates its applicability by solving the freshness problem in the context of concurrent data structures.
16.	Ha, Phuong, 1976, et al. (författare) Reactive multi-word synchronization for multiprocessors 2003 Ingår i: Proceedings of the 12th IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques (PACT'03). - 0769520219 ; , s. 184-193 Konferensbidrag (refereegranskat)
17.	Ha, Phuong, 1976, et al. (författare) Reactive multi-word synchronization for multiprocessors 2004 Ingår i: The Journal of Instruction-Level Parallelism. ; 6:Special issue Tidskriftsartikel (refereegranskat)abstract Shared memory multiprocessor systems typically provide a set of hardware primitives in order to support synchronization. Generally, they provide "single-word" read-modify-write hardware primitives such as compare-and-swap, load-linked/store-conditional and fetch-and-op, from which the higher-level synchronization operations are then implemented in software. Although the "single-word" hardware primitives are conceptually powerful enough to support higher-level synchronization, from the programmer's point of view they are not as useful as their generalizations to the "multi-word" objects. This paper presents two fast and reactive lock-free "multi-word" compare-and-swap algorithms. The algorithms dynamically measure the level of contention as well as the memory conflicts of the "multi-word" compare-and-swap operations, and in response, they react accordingly in order to guarantee good performance in a wide range of system conditions. The algorithms are non-blocking (lock-free), allowing in this way fast dynamical behavior. Experiments on thirty processors of an SGI Origin2000 multiprocessor show that both our algorithms react quickly according to the contention variations and outperform the best known alternatives in almost all contention conditions.
18.	Ha, Phuong, 1976 (författare) Reactive Shared Objects for Interprocess Synchronization 2004 Licentiatavhandling (övrigt vetenskapligt/konstnärligt)abstract In parallel processing environments such as multiprocessor systems, processes are synchronized using concurrent objects, which allow many concurrent processes to access them at the same time. The performance of these concurrent objects heavily relies on the load conditions of the surrounding environment (e.g. OS, other applications, interconnection network, etc.), which are variant and unpredictable due to indeterministic interferences among processes.Therefore, in order to minimize synchronization cost, a major overhead in parallel applications, we need to construct efficient concurrent objects that can react to the environment variations in order to achieve good performance under all possible system conditions. This thesis presents two new reactive shared objects: the "reactive multi-word compare-and-swap object" and the "self-tuning reactive tree".The "multi-word" compare-and-swap objects are powerful constructs, which make the design of concurrent data objects much more effective and easier. The "reactive" multi-word compare-and-swap objects are advanced objects that dynamically measure the level of contention as well as the memory conflicts of the multi-word compare-and-swap operations, and in response, they react accordingly in order to guarantee good performance in a wide range of system conditions. We present two algorithms that are non-blocking (lock-free), allowing in this way a fast and dynamical behavior.The self-tuning reactive trees distribute a set of processes to smaller groups accessing different parts of the memory in a coordinated manner, and moreover reactively adjust their size in order to attain efficient performance across different levels of contention. We present a data structure that in an on-line manner balances the trade-off between the tree traversal latency and the latency due to contention at the tree leaves.The experiments on the SGI Origin2000, a well-known commercial ccNUMA multiprocessor, showed that the new reactive objects achieve significant improvements compared to previous results.
19.	Ha, Phuong, 1976, et al. (författare) Reactive Spin-locks: A Self-tuning Approach 2005 Ingår i: The 8th International Symposium on Parallel Architectures, Algorithms and Networks, IEEE press. - 0769525091 ; , s. 33-38 Konferensbidrag (refereegranskat)abstract Reactive spin-lock algorithms that can automatically adapt to contention variation on the lock have received great attention in the field of multiprocessor synchronization, since they can help applications achieve good performance inall possible contention conditions. However, in existing reactive spin-locks the reaction relies on (i) some "fixed" experimentally tuned thresholds, which may get frequently inappropriate in dynamic environments like multiprogramming/multiprocessor systems, or (ii) known probability distributions of inputs. This paper presents a new reactive spin-lock algorithm that is completely self-tuning, which means no experimentally tuned parameter nor probability distribution of inputs are needed. The new spin-lock is built on a competitive online algorithm. Our experiments, which use the Spark98 kernels and the SPLASH-2 applications as application benchmarks, on a multiprocessor machine SGI Origin2000 and on an Intel Xeon workstation show that the new self-tuning spin-lock helps applications with different characteristics achieve good performance in a wide range of contention levels.
20.	Ha, Phuong, 1976, et al. (författare) Self-Adjusting Trees 2003 Rapport (övrigt vetenskapligt/konstnärligt)
21.	Ha, Phuong, 1976, et al. (författare) Self-tuning reactive diffracting trees 2007 Ingår i: Journal of Parallel and Distributed Computing. - : Elsevier BV. - 1096-0848 .- 0743-7315. ; 67:6, s. 674-694 Tidskriftsartikel (refereegranskat)abstract Reactive diffracting trees are efficient distributed objects that support synchronization, by distributing sets of memory accesses to different memory banks in a coordinated manner. They adjust their size in order to retain their efficiency in the presence of different contention levels. Their adjustment is sensitive to parameters that have to be manually determined after experimentation. Since these parameters depend on the application as well as on the system configuration and load, determining their optimal values is difficult in practice. Moreover, the adjustments are done one level at a time, hence the cost of multi-level adjustments can be high.This paper presents a new method for reactive diffracting trees, without the need of hand-tuned parameters. The new self-tuning trees (ST-trees) can balance, in an online manner, the trade-off between the tree-traversal latency and the latency due to contention on accessing the leaf nodes (i.e. the nodes where the desirable computation takes place). Moreover, the paper presents a data structure that enables the trees to grow or shrink by several levels in one adjustment step. The behavior of the reactive diffracting trees is illustrated in the paper via experiments performed on a well-known ccNUMA multiprocessor system. The experiments study the new self-tuning trees, also in connection with the original hand-tuned reactive diffracting trees. The experiments have showed that the new self-tuning trees are efficient, and that they react in the same way (i.e. select the same tree depth for the same contention level) as the hand-tuned trees, while they are able to adjust quicker than the latter (as they are able to grow or shrink by several levels in one adjustment step).
22.	Ha, Phuong, 1976, et al. (författare) Self-tuning Reactive Distributed Trees for Counting and Balancing 2004 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. ; 3544, s. 140-157 Tidskriftsartikel (refereegranskat)abstract The main contribution of this paper is that it shows that it is possible to have reactive distributed trees for counting and balancing with no need for the user to fix manually any parameters. We present a data structure that in an on-line manner balances the trade-off between the tree traversal latency and the latency due to contention at the tree nodes. Moreover, the fact that our method can expand or shrink a subtree several levels in any adjustment step, has a positive effect in the efficiency: this feature helps the self-tuning reactive tree minimize the adjustment time, which affects not only the execution time of the process adjusting the size of the tree but also the latency of all other processes traversing the tree at the same time with no extra memory requirements. Our experimental study compared the new trees with the reactive diffracting ones on the SGI Origin2000, a well-known commercial ccNUMA multiprocessor. This study showed that the self-tuning reactive trees i) select the same tree depth as the reactive diffracting trees do; ii) perform better and iii) react faster.
23.	Ha, Phuong, 1976, et al. (författare) The Synchronization Power of Coalesced Memory Accesses 2008 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. - 9783540877783 ; 5218, s. 320-334 Konferensbidrag (refereegranskat)abstract Multicore processor architectures have established themselves as the new generation of processor architectures. As part of the one core to many cores evolution, memory access mechanisms have advanced rapidly. Several new memory access mechanisms have been implemented in many modern commodity multicore processors. Memory access mechanisms, by devising how processing cores access the shared memory, directly influence the synchronization capabilities of the multicore processors. Therefore, it is crucial to investigate the synchronization power of these new memory access mechanisms.This paper investigates the synchronization power of coalesced memory accesses, a family of memory access mechanisms introduced in recent large multicore architectures like the CUDA graphics processors. We first design three memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models in terms of their consensus numbers. These tight results show that the coalesced memory access mechanisms can facilitate strong synchronization between the threads of multicore processors, without the need of synchronization primitives other than reads and writes. In the case of the contemporary CUDA processors, our results imply that the coalesced memory access mechanisms have consensus numbers up to sixteen.
24.	Ha, Phuong, 1976, et al. (författare) Wait-free Programming for General Purpose Computations on Graphics Processors 2008 Ingår i: the Proceedings of the 22th International Parallel and Distributed Symposium (IPDPS 2008). - 1530-2075. - 9781424416936 ; , s. 1-12 Konferensbidrag (refereegranskat)abstract The fact that graphics processors (GPUs) are today’s most powerful computational hardware for the dollar has motivated researchers to utilize the ubiquitous and powerful GPUs for general-purpose computing. Recent GPUs feature the single-program multiple-data (SPMD) multicore architecture instead of the single-instruction multiple-data (SIMD). However, unlike CPUs, GPUs devote their transistors mainly to data processing rather than data caching and flow control, and consequently most of the powerful GPUs with many cores do not support any synchronization mechanisms between their cores. This prevents GPUs from being deployed more widely for general-purpose computing. This paper aims at bridging the gap between the lack of synchronization mechanisms in recent GPU architectures and the need of synchronization mechanisms in parallel applications. Based on the intrinsic features of recent GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-and-set and compare-and-swap. Accesses to the wait-free objects have time complexity O(N), whether N is the number of processes. Our result demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.
25.	Larsson, Andreas, 1979, et al. (författare) Multi-word Atomic Read/Write Registers on Multiprocessor Systems 2004 Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Berlin, Heidelberg : Springer Berlin Heidelberg. - 1611-3349 .- 0302-9743. - 3540230254 ; 3221, s. 736-748 Rapport (övrigt vetenskapligt/konstnärligt)abstract Modern multiprocessor systems offer advanced synchronization primitives, built in hardware, to support the development of efficient parallel algorithms. In this paper we develop a simple and efficient algorithm for atomic registers (variables) of arbitrary length. The simplicity and better complexity of the algorithm is achieved via the utilization of two such common synchronization primitives. In this paper we also evaluate the performance of our algorithm and the performance of a practical previously know algorithm that is based only on read and write primitives. The evaluation is performed on 3 well-known, parallel architectures. This evaluation clearly shows that both algorithms are practical and that as the size of the register increases our algorithm performs better, accordingly to its complexity behavior.
26.	Larsson, Andreas, 1979, et al. (författare) Multiword atomic read/write registers on multiprocessor systems 2008 Ingår i: Journal of Experimental Algorithmics. - 1084-6654. ; 13, s. 7-30 Tidskriftsartikel (refereegranskat)abstract Modern multiprocessor systems offer advanced synchronization primitives, built in hardware, to support the development of efficient parallel algorithms. In this article, we develop a simple and efficient algorithm, the READERSFIELD algorithm, for atomic registers (variables) of arbitrary length. The simplicity and better complexity of the algorithm is achieved via the utilization of two such common synchronization primitives. In this article, we also experimentally evaluate the performance of our algorithm, together with lock-based approaches and a practical, previously previously known algorithm that is based that is based only on read and write primitives. The experimental evaluation is performed on three well-known parallel architectures. This evaluation clearly shows that both algorithms are practical and that as the size of the register increases the READERSFIELD algorithm performs better, following its analytical evaluation.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Träfflista för sökning "WFRF:(Ha Phuong 1976) "

Avgränsa träffmängd

År