SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Walulya Ivan 1985) "

Sökning: WFRF:(Walulya Ivan 1985)

  • Resultat 1-10 av 18
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Bäckström, Karl, 1994, et al. (författare)
  • Consistent lock-free parallel stochastic gradient descent for fast and stable convergence
  • 2021
  • Ingår i: Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021. ; , s. 423-432
  • Konferensbidrag (refereegranskat)abstract
    • Stochastic Gradient Descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous shared-memory parallel SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for problems satisfying smooth, strongly convex targets, and gradient sparsity. Recent works take important steps towards understanding the potential of parallel SGD for problems not conforming to these strong assumptions, in particular for deep learning (DL). There is however a gap in current literature in understanding when AsyncSGD algorithms are useful in practice, and in particular how mechanisms for synchronization and consistency play a role. We contribute with answering questions in this gap by studying a spectrum of parallel algorithmic implementations ofAsyncSGD, aiming to understand how shared-data synchronization influences the convergence properties in fundamental DL applications. We focus on the impact of consistency-preserving non-blocking synchronization in SGD convergence, and in sensitivity to hyper-parameter tuning. We propose Leashed-SGD, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency. Leashed-SGD features a natural contention-regulating mechanism, as well as dynamic memory management, allocating space only when needed. We argue analytically about the dynamics of the algorithms, memory consumption, the threads' progress over time, and the expected contention. We provide a comprehensive empirical evaluation, validating the analytical claims, benchmarking the proposed Leashed-SGD framework, and comparing to baselines for two prominent deep learning (DL) applications: multilayer perceptrons (MLP) and convolutional neural networks (CNN). We observe the crucial impact of contention, staleness and consistency and show how, thanks to the aforementioned properties, Leashed-SGD provides significant improvements in stability as well as wall-clock time to convergence (from 20-80% up to 4 ×improvements) compared to the standard lock-based AsyncSGD algorithm and HOGWILD!, while reducing the overall memory footprint.
  •  
2.
  • Chatterjee, Bapi, 1982, et al. (författare)
  • Concurrent Linearizable Nearest Neighbour Search in LockFree-kD-tree
  • 2015
  • Rapport (övrigt vetenskapligt/konstnärligt)abstract
    • The Nearest neighbour search (NNS) is an important problem in a large number of application domains dealing with multidimensional data. In concurrent settings, where dynamic modi?cations are allowed, a linearizable implementation of NNS is highly desirable to discover the latest nearest neighbour of a given target data-point. In this paper, we introduce the LockFree-kD-tree (LFkD-tree): a lock-free concurrent kD-tree, which implements an abstract data type (ADT) that provides the operations Add, Remove, Contains, and NNS. Our implementation is linearizable. The operations in the LFkD-tree use single-word read and compare-and-swap (CAS) atomic primitives, which are readily supported on commonly available multi-core processors. We experimentally evaluate the LFkD-tree using several benchmarks comprising real-world and synthetic datasets. The experiments show that the presented design is scalable and achieves significant speed-up compared to the implementations of an existing sequential kD-tree and a recently proposed multidimensional indexingstructure, PH-tree.
  •  
3.
  • Chatterjee, Bapi, 1982, et al. (författare)
  • Concurrent linearizable nearest neighbour search in lockfree-kd-Tree
  • 2018
  • Ingår i: ACM International Conference Proceeding Series. - New York, NY, USA : ACM. ; Part F133180
  • Konferensbidrag (refereegranskat)abstract
    • The Nearest neighbour search (NNS) is a fundamental problem in many application domains dealing with multidimensional data. In a concurrent setting, where dynamic modi-fications are allowed, a linearizable implementation of NNS is highly desirable. This paper introduces the LockFree-kD-Tree (LFkD-Tree): A lock-free concurrent kD-Tree, which implements an abstract data type (ADT) that provides the operations Add, Remove, Contains, and NNS. Our implementation is linearizable. The operations in the LFkD-Tree use single-word read and compare-And-swap (CAS) atomic primitives, which are readily supported on available multi-core processors. We experimentally evaluate the LFkD-Tree using several benchmarks comprising real-world and synthetic datasets. The experiments show that the presented design is scalable and achieves signi cant speed-up compared to the implementations of an existing sequential kD-Tree and a recently proposed multidimensional indexing structure, PH-Tree. © 2018 Copyright held by the owner/author(s).
  •  
4.
  • Chatterjee, Bapi, 1982, et al. (författare)
  • Concurrent linearizable nearest neighbour search in LockFree-kD-tree
  • 2021
  • Ingår i: Theoretical Computer Science. - : Elsevier BV. - 0304-3975. ; 886, s. 27-48
  • Tidskriftsartikel (refereegranskat)abstract
    • The Nearest neighbour search (NNS) is a fundamental problem in many application domains dealing with multidimensional data. In a concurrent setting, where dynamic modifications are allowed, a linearizable implementation of the NNS is highly desirable. This paper introduces the LockFree-kD-tree (LFkD-tree ): a lock-free concurrent kD-tree, which implements an abstract data type (ADT) that provides the operations Add, Remove, Contains, and NNS. Our implementation is linearizable. The operations in the LFkD-tree use single-word read and compare-and-swap ([Formula presented] ) atomic primitives, which are readily supported on available multi-core processors. We experimentally evaluate the LFkD-tree using several benchmarks comprising real-world and synthetic datasets. The experiments show that the presented design is scalable and achieves significant speed-up compared to the implementations of an existing sequential kD-tree and a recently proposed multidimensional indexing structure, PH-tree.
  •  
5.
  • Chatterjee, Bapi, 1982, et al. (författare)
  • Help-optimal and Language-portable Lock-free Concurrent Data Structures
  • 2016
  • Rapport (övrigt vetenskapligt/konstnärligt)abstract
    • Helping is the most common mechanism to guarantee lock-freedom in many concurrent data structures. An optimized helping strategy improves the overall performance of a lock-free algorithm. In this paper, we propose help-optimality, which essentially implies that no operationstep is accounted for exclusive helping in the lock-free synchronization of concurrent operations. To describe the concept, we revisit the designs of a lock-free linked-list and a lock-free binary search tree and present improved algorithms. Our algorithms employ atomic single-word compare-and-swap (CAS) primitives and are linearizable.Additionally, we do not use a language/platform speci?c mechanism to modulate helping, speci?cally, we use neither bit-stealing from a pointer nor runtime type introspection of objects, making the algorithms language-portable. Further, to optimize the amortized numberof steps per operation, if a CAS execution to modify a shared pointer fails, we obtain a fresh set of thread-local variables without restarting an operation from scratch.We use several micro-benchmarks in both C/C++ and Java to validate the e?ciency of our algorithms against existing state-of-the-art. The experiments show that the algorithms are scalable. Our implementations perform on a par with highly optimized ones and in manycases yield 10%-50% higher throughput.
  •  
6.
  • Chatterjee, Bapi, 1982, et al. (författare)
  • Help-Optimal and Language-Portable Lock-Free Concurrent Data Structures
  • 2016
  • Ingår i: 45th International Conference on Parallel Processing (ICPP), 2016. - 0190-3918. - 9781509028238 ; 2016 september, s. 360-369
  • Konferensbidrag (refereegranskat)abstract
    • Helping is a widely used technique to guarantee lock-freedom in many concurrent data structures. An optimized helping strategy improves the overall performance of a lock-free algorithm. In this paper, we propose help-optimality, which essentially implies that no operation step is accounted for exclusive helping in the lock-free synchronization of concurrent operations. To describe the concept, we revisit the designs of a lock-free linked-list and a lock-free binary search tree and present improved algorithms. Our algorithms employ atomic single-word compare-and-swap (CAS) primitives and are linearizable. We design the algorithms without using any language/platformspecific mechanism. Specifically, we use neither bit-stealing froma pointer nor runtime type introspection of objects. Thus, our algorithms are language-portable. Further, to optimize the amortized number of steps per operation, if a CAS execution tomodify a shared pointer fails, we obtain a fresh set of thread-local variables without restarting an operation from scratch. We use several micro-benchmarks in both C/C++ and Java to validate the efficiency of our algorithms against existing state-of-the-art. The experiments show that the algorithms are scalable. Our implementations perform on a par with highly optimizedones and in many cases yield 10%-50% higher throughput.
  •  
7.
  • Gulisano, Vincenzo Massimiliano, 1984, et al. (författare)
  • Deterministic Real-Time Analytics of Geospatial Data Streams through ScaleGate Objects
  • 2015
  • Ingår i: DEBS 2015 - Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems. - New York, NY, USA : ACM. - 9781450332866
  • Konferensbidrag (refereegranskat)abstract
    • In this work we present the design, implementation and evaluation of our approach to solve the DEBS 2015 Grand Challenge. Our work studies how ScaleGate, a concurrent implementation of a recently proposed abstract data type, that articulates data access in parallel data streaming, can be leveraged to partition the Grand Challenge analysis among an arbitrary number of processing units. ScaleGate aims not only at supporting high throughput and low latency parallel streaming analysis, but also at guaranteeing deterministic processing, which is one of the biggest challenges in parallelizing computation while maintaining consistency.Our main contribution is a new perspective for addressing the high throughput, low latency and determinism challenges of parallel data streaming by letting such challenges permeate the entire analysis framework, down to its underlying shared data objects. As a result, we propose shared data objects that balance independent actions among processing threads in order to guarantee high throughput, while providing the necessary synchronization for deterministic processing.
  •  
8.
  • Papadopoulos, L., et al. (författare)
  • A Systematic Methodology for Optimization of Applications Utilizing Concurrent Data Structures
  • 2016
  • Ingår i: IEEE Transactions on Computers. - 0018-9340. ; 65:7, s. 2019-2031
  • Tidskriftsartikel (refereegranskat)abstract
    • Modern multicore embedded systems often execute applications that rely heavily on concurrent data structures. The selection of efficient concurrent data structure implementations for a specific application is usually a complex and time consuming task, because each design decision often affects the performance and the energy consumption of the embedded system in various and occasionally unpredictable ways. The complexity is normally addressed by developers by adopting ad-hoc design solutions, which are often suboptimal and yield poor results. To face this problem, we propose a semi-automated methodology for the optimization of applications that utilize concurrent data structures that is based on design space exploration. The proposed approach is evaluated by using both microbenchmarks and real-world applications that are executed on multicore embedded systems with different architectural specifications. Our results show that we can identify various trade-offs between different data structure implementations that can be used to optimize applications that rely on concurrent data structures.
  •  
9.
  • Papadopoulos, L., et al. (författare)
  • Customization methodology for implementation of streaming aggregation in embedded systems
  • 2016
  • Ingår i: Journal of Systems Architecture. - : Elsevier BV. - 1383-7621. ; 66-67, s. 48-60
  • Tidskriftsartikel (refereegranskat)abstract
    • Streaming aggregation is a fundamental operation in the area of stream processing and its implementation provides various challenges. Data flow management is traditionally performed by high performance computing systems. However, nowadays there is a trend of implementing streaming operators in low power embedded devices, due to the fact that they often provide increased performance per watt in comparison with traditional high performance systems. In this work, we present a methodology for the customization of streaming aggregation implemented in modern low power embedded devices. The methodology is based on design space exploration and provides a set of customized implementations that can be used by developers to perform trade-offs between throughput, latency, memory and energy consumption. We compare the proposed embedded system implementations of the streaming aggregation operator with the corresponding HPC and GPGPU implementations in terms of performance per watt. Our results show that the implementations based on low power embedded systems provide up to 54 and 14 times higher performance per watt than the corresponding Intel Xeon and Radeon HD 6450 implementations, respectively. (C) 2016 Elsevier B.V. All rights reserved.
  •  
10.
  • Papadopoulos, L., et al. (författare)
  • Evaluation of message passing synchronization algorithms in embedded systems
  • 2014
  • Ingår i: 14th International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS 2014. - 9781479937707 ; , s. 282-289
  • Konferensbidrag (refereegranskat)abstract
    • The constantly increasing computational power of the embedded systems is based on the integration of a large number of cores on a single chip. In such complex platforms, the synchronization of the accesses of the shared memory data is becoming a major issue, since it affects the performance of the whole system. This problem, which is currently a challenge in the embedded systems, has been studied in the High Performance Computing domain, where several message passing algorithms have been designed to efficiently avoid the limitations coming from locking. In this work, inspired from the work on message passing synchronization algorithms in the High Performance Computing domain we design and evaluate a set of synchronization algorithms for multi-core embedded platforms. We compare them with the corresponding lock-based implementations and prove that message passing synchronization algorithms can be efficiently utilized in multi-core embedded systems. By using message passing synchronization instead of lock-based, we managed to reduce the execution time of our benchmark up to 29.6%.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 18

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy