SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "L773:0190 3918 OR L773:9781509028238 "

Sökning: L773:0190 3918 OR L773:9781509028238

  • Resultat 1-10 av 11
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Chatterjee, Bapi, 1982, et al. (författare)
  • Help-Optimal and Language-Portable Lock-Free Concurrent Data Structures
  • 2016
  • Ingår i: 45th International Conference on Parallel Processing (ICPP), 2016. - 0190-3918. - 9781509028238 ; 2016 september, s. 360-369
  • Konferensbidrag (refereegranskat)abstract
    • Helping is a widely used technique to guarantee lock-freedom in many concurrent data structures. An optimized helping strategy improves the overall performance of a lock-free algorithm. In this paper, we propose help-optimality, which essentially implies that no operation step is accounted for exclusive helping in the lock-free synchronization of concurrent operations. To describe the concept, we revisit the designs of a lock-free linked-list and a lock-free binary search tree and present improved algorithms. Our algorithms employ atomic single-word compare-and-swap (CAS) primitives and are linearizable. We design the algorithms without using any language/platformspecific mechanism. Specifically, we use neither bit-stealing froma pointer nor runtime type introspection of objects. Thus, our algorithms are language-portable. Further, to optimize the amortized number of steps per operation, if a CAS execution tomodify a shared pointer fails, we obtain a fresh set of thread-local variables without restarting an operation from scratch. We use several micro-benchmarks in both C/C++ and Java to validate the efficiency of our algorithms against existing state-of-the-art. The experiments show that the algorithms are scalable. Our implementations perform on a par with highly optimizedones and in many cases yield 10%-50% higher throughput.
  •  
2.
  • Ekelin, Cecilia, 1973, et al. (författare)
  • A Lower-Bound Algorithm for Minimizing Network Communication in Real-Time Systems
  • 2002
  • Ingår i: International Conference on Parallel Processing, 2002. Proceedings.. - 0190-3918. - 0769516777
  • Konferensbidrag (refereegranskat)abstract
    • In this paper, we propose a pseudo-polynomial-time lower-bound algorithm for the problem of assigning and scheduling real-time tasks in a distributed system such that the network communication is minimized The key feature of our algorithm is translating the task assignment problem into the so called k-cut problem of a graph, which is known to be solvable in polynomial time for fixed k. Experiments show that the lower bound computed by our algorithm in fact is optimal in up to 89% of the cases and increases the speed of an overall optimization algorithm by a factor of two on average.
  •  
3.
  • Hollmann, Jochen, 1970, et al. (författare)
  • Enhancing Garbage Collection Synchronization using Explicit Bit Barriers
  • 2015
  • Ingår i: 44th International Conference on Parallel Processing, ICPP 2015, Beijing, China, 1-4 September. - 0190-3918. - 9781467375870 ; 2015-December, s. 769 - 778
  • Konferensbidrag (refereegranskat)abstract
    • Multicore architectures offer a convenient way to unlock concurrency between application (called mutator) and garbage collector, yet efficient synchronization between the two by means of barriers is critical to unlock this concurrency. Hardware Transactional Memory (HTM), now commercially available, opens up new ways for synchronization with dramatically lower overhead for the mutator. Unfortunately, HTM-based schemes proposed to date either require specialized hardware support or impose severe overhead through invocation of OS-level trap handlers. This paper proposes Explicit Bit Barriers (EBB), a novel approach for fast synchronization between the mutator and HTM-encapsulated relocation tasks. We compare the efficiency of EBBs with read barriers based on virtual memory that rely on OS-level trap handlers. We show that EBBs are nearly as efficient as those needing specialized hardware, but run on commodity Intel processors with TSX extensions.
  •  
4.
  • Knyaginin, Dmitry, 1983, et al. (författare)
  • Crystal: A design-time resource partitioning method for hybrid main memory
  • 2014
  • Ingår i: ICPP 2014: International Conference on Parallel Processing; Minneapolis; United States; 9 September 2014 through 12 September 2014. - 0190-3918. ; 2014-November:November, s. 90-100
  • Konferensbidrag (refereegranskat)abstract
    • Non-Volatile Memory (NVM) technologies can be used to reduce system-level execution time, energy, or cost but they add a new design dimension. Finding the best amounts of DRAM and NVM in hybrid main memory systems is a nontrivial design-time issue, the best solution to which depends on many factors. Such resource partitioning between DRAM and NVM can be framed as an optimization problem where the minimum of a target metric is sought, trends matter more than absolute values, and thus the precision of detailed modeling is overkill. Here we present Crystal, an analytic approach to early and rapid design-time resource partitioning of hybrid main memories. Crystal provides first-order estimates of system-level execution time and energy, sufficient to enable exhaustive search of the best amount and type of NVM for given workloads and partitioning goals. Crystal thus helps system designers to quickly find the most promising hybrid configurations for detailed evaluation. E.g., Crystal shows how for specific workloads higher system-level performance and energy efficiency can be achieved by employing an NVM with the speed and energy consumption of NAND Flash instead of a much faster and more energy efficient NVM like phase-change memory.
  •  
5.
  • Larsen, P, et al. (författare)
  • Parallelizing more loops with compiler guided refactoring
  • 2012
  • Ingår i: Proceedings of the International Conference on Parallel Processing. 41st International Conference on Parallel Processing, ICPP 2012, Pittsburgh, PA, 10 - 13 September 2012. - 0190-3918. - 9780769547961 ; , s. 410-419
  • Konferensbidrag (refereegranskat)abstract
    • The performance of many parallel applications relies not on instruction-level parallelism but on loop-level parallelism. Unfortunately, automatic parallelization of loops is a fragile process, many different obstacles affect or prevent it in practice. To address this predicament we developed an interactive compilation feedback system that guides programmers in iteratively modifying their application source code. This helps leverage the compiler's ability to generate loop-parallel code. We employ our system to modify two sequential benchmarks dealing with image processing and edge detection, resulting in scalable parallelized code that runs up to 8.3 times faster on an eight-core Intel Xeon 5570 system and up to 12.5 times faster on a quad-core IBM POWER6 system. Benchmark performance varies significantly between the systems. This suggests that semi-automatic parallelization should be combined with target-specific optimizations. Furthermore, comparing the first benchmark to manually-parallelized, hand-optimized pthreads and OpenMP versions, we find that code generated using our approach typically outperforms the pthreads code (within 93-339%). It also performs competitively against the OpenMP code (within 75-111%). The second benchmark outperforms manually-parallelized and optimized OpenMP code (within 109-242%).
  •  
6.
  • Manivannan, Madhavan, 1986, et al. (författare)
  • Efficient Forwarding of Producer-Consumer Data in Task-based Programs
  • 2013
  • Ingår i: Proceedings of the International Conference on Parallel Processing. 40th International Conference on Parallel Processing, ICPP 2013, Lyon, 1-4 October 2013. - 0190-3918. - 9780769551173 ; , s. 517-522
  • Konferensbidrag (refereegranskat)abstract
    • Task-based programming models are increasingly being adopted due to their ability to express parallelism. Theyalso lead to higher programmer productivity by delegating tothe run-time system and the architecture demanding parallelism management tasks such as scheduling and staging of the communication between tasks.This paper focuses on techniques to optimize producer-consumer sharing in task-based programs. As the set of producer and consumer tasks can often be statically determined, coherence prediction techniques are expected to successfully optimize producer-consumer sharing. We show that they are ineffective because the mapping of tasks to cores changes based on runtime conditions. The paper contributes with a technique that forwards produced and spatially close blocks to the consumer in a single transaction when a consumer requests a first block.We also find that stride prefetching is competitive with ourforwarding technique for sufficiently coarse tasks. However, its effectiveness deteriorates as the task granularity is reduced because of limited opportunities to train for the access pattern and to issue prefetches sufficiently ahead of time. This makes our forwarding scheme a robust alternative to reduce communicationoverhead in task-based programs.
  •  
7.
  • Manivannan, Madhavan, 1986, et al. (författare)
  • Implications of Merging Phases on Scalability of Multi-core Architectures
  • 2011
  • Ingår i: Proceedings of the International Conference on Parallel Processing. 40th International Conference on Parallel Processing, ICPP 2011, Taipei City, 13-16 September 2011. - 0190-3918. - 9780769545103 ; , s. 622-631
  • Konferensbidrag (refereegranskat)abstract
    • Amdahl's Law dictates that in parallel applications serial sections establish an upper limit on the scalability. Asymmetric chip multiprocessors with a large core in addition to several small cores have been advocated for recently as a promising design paradigm because the large core can accelerate the execution of serial sections and hence mitigate the scalability bottlenecks due to large serial sections. This paper studies the scalability of a set of data mining workloads that have negligible serial sections. The formulation of Amdahl's Law, that optimistically assumes constant serial sections, estimates these workloads to scale to hundreds of cores in a chip multiprocessor (CMP). However the overhead in carrying out merging (or reduction) operations makes scalability to peak at lesser number. We establish this by extending the Amdahl's speedup model to factor in the impact of reduction operations on the speedup of applications on symmetric as well as asymmetric CMP designs. Our analytical model estimates that asymmetric CMPs with one large and many tiny cores are only optimal for applications with a low reduction overhead. However, as the overhead starts to increase, the balance is shifted towards using fewer but more capable cores. This eventually limits the performance advantage of asymmetric over symmetric CMPs.
  •  
8.
  • Negi, Anurag, 1980, et al. (författare)
  • Eager meets lazy: The impact of write-buffering on hardware transactional memory
  • 2011
  • Ingår i: Proceedings of the International Conference on Parallel Processing. 40th International Conference on Parallel Processing, ICPP 2011, Taipei City, 13-16 September 2011. - 0190-3918. - 9780769545103 ; , s. 73-82
  • Konferensbidrag (refereegranskat)abstract
    • Hardware transactional memory (HTM) systems have been studied extensively along the dimensions of speculative versioning and contention management policies. The relative performance of several designs policies has been discussed at length in prior work within the framework of scalable chipmultiprocessing systems. Yet, the impact of simple structural optimizations like write-buffering has not been investigated and performance deviations due to the presence or absence of these optimizations remains unclear. This lack of insight into the effective use and impact of these interfacial structures between the processor core and the coherent memory hierarchy forms the crux of the problem we study in this paper. Through detailed modeling of various write-buffering configurations we show that they play a major role in determining the overall performance of a practical HTM system. Our study of both eager and lazy conflict resolution mechanisms in a scalable parallel architecture notes a remarkable convergence of the performance of these two diametrically opposite design points when write buffers are introduced and used well to support the common case. Mitigation of redundant actions, fewer invalidations on abort, latency-hiding and prefetch effects contribute towards reducing execution times for transactions. Shorter transaction durations also imply a lower contention probability, thereby amplifying gains even further. The insights, related to the interplay between buffering mechanisms, system policies and workload characteristics, contained in this paper clearly distinguish gains in performance to be had from write-buffering from those that can be ascribed to HTM policy. We believe that this information would facilitate sound design decisions when incorporating HTMs into parallel architectures.
  •  
9.
  • Pardo Jimenez, Raul, 1988, et al. (författare)
  • GPU powered ROSA analyzer
  • 2013
  • Ingår i: 42nd Annual International Conference on Parallel Processing, ICPP 2013; Lyon; France; 1 October 2013 through 4 October 2013. - 0190-3918. - 9780769551173 ; , s. 901-908
  • Konferensbidrag (refereegranskat)abstract
    • In this work we present the first version of ROSAA, Rosa Analyzer, using a GPU architecture. ROSA is a Markovian Process Algebra [11] able to capture pure non-determinism, probabilities and timed actions; Over it, a tool has been developed for getting closer to a fully automatic process of analyzing the behaviour of a system specified as a process of ROSA, so that, ROSAA [10] is able to automatically generate the part of the Labeled Transition System (occasionally the whole one), LTS in the sequel, in which we could be interested, but, since this is a very computationally expensive task, a GPU powered version of ROSAA which includes parallel processing capabilities, has been created to better deal with such generating process. As the conventional GPU processing loads are mainly focused on data parallelization over quite similar types of data, this work means a quite novel use of these kind of architectures, moreover the authors do not know any other formal model tool running over GPUs. ROSAA running starts with the Syntactic analysis so generating a layered structure suitable to, afterwards, apply the Operational Semantics transition rules in the easiest way. Since from each specification/state more than one rule could be applied, this is the key point at which GPU should provide its benefits, i.e., allowing to generate all the new states reachable in a singlesemantics- step from a given one, at the same time through a simultaneous launching of a set of threads over the GPU platform. Although this establishes a step forward to the practical usefulness of such type of tools, the state-explosion problem arises indeed, so we are aware that reducing the size of the LTS will be sooner or later required, in this line the authors are working on an heuristics to properly prune an enough number of branches of the LTS, so making the task of generating it, more tractable.
  •  
10.
  • Pathan, Risat, 1979, et al. (författare)
  • Parameterized Schedulability Analysis on Uniform Multiprocessors
  • 2010
  • Ingår i: Proceedings of the International Conference on Parallel Processing (ICPP), San Diego, CA, September 13-16, 2010. - 0190-3918. - 9780769541563 ; , s. 323-332
  • Konferensbidrag (refereegranskat)abstract
    • This paper addresses global Rate-Monotonic (RM) scheduling of implicit-deadline periodic real-time tasks on uniform multiprocessor platforms. In particular, we propose new schedulability conditions that include a set of easily computable task-set parameters for achieving better system utilization while meeting the deadlines of all the tasks. First, an individual sufficient schedulability condition is derived for each task. Then, the collection of schedulability conditions for the tasks are condensed to provide two different simple sufficient schedulability conditions for the entire task system --- one for uniform multiprocessors, and one for unit-capacity multiprocessors, respectively. Finally, we show that our proposed simple rate-monotonic schedulability conditions for uniform and unit-capacity multiprocessors have higher worst-case system utilization than all other state-of-the-art simple schedulability conditions for global rate-monotonic scheduling of implicit-deadline tasks.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 11

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy