SwePub - sökning: L773:1573 0484 OR L773:0920 85...

Numrering	Referens	Omslagsbild	Hitta
1.	Abdi, Somayeh, et al. (författare) Cost-aware workflow offloading in edge-cloud computing using a genetic algorithm 2024 Ingår i: Journal of Supercomputing. - : Springer. - 0920-8542 .- 1573-0484. Tidskriftsartikel (refereegranskat)abstract The edge-cloud computing continuum effectively uses fog and cloud servers to meet the quality of service (QoS) requirements of tasks when edge devices cannot meet those requirements. This paper focuses on the workflow offloading problem in edge-cloud computing and formulates this problem as a nonlinear mathematical programming model. The objective function is to minimize the monetary cost of executing a workflow while satisfying constraints related to data dependency among tasks and QoS requirements, including security and deadlines. Additionally, it presents a genetic algorithm for the workflow offloading problem to find near-optimal solutions with the cost minimization objective. The performance of the proposed mathematical model and genetic algorithm is evaluated on several real-world workflows. Experimental results demonstrate that the proposed genetic algorithm can find admissible solutions comparable to the mathematical model and outperforms particle swarm optimization, bee life algorithm, and a hybrid heuristic-genetic algorithm in terms of workflow execution costs.
2.	Atzori, Marco, 1992-, et al. (författare) In situ visualization of large-scale turbulence simulations in Nek5000 with ParaView Catalyst 2022 Ingår i: Journal of Supercomputing. - : Springer. - 0920-8542 .- 1573-0484. ; 78:3, s. 3605-3620 Tidskriftsartikel (refereegranskat)abstract In situ visualization on high-performance computing systems allows us to analyze simulation results that would otherwise be impossible, given the size of the simulation data sets and offline post-processing execution time. We develop an in situ adaptor for Paraview Catalyst and Nek5000, a massively parallel Fortran and C code for computational fluid dynamics. We perform a strong scalability test up to 2048 cores on KTH’s Beskow Cray XC40 supercomputer and assess in situ visualization’s impact on the Nek5000 performance. In our study case, a high-fidelity simulation of turbulent flow, we observe that in situ operations significantly limit the strong scalability of the code, reducing the relative parallel efficiency to only ≈ 21 % on 2048 cores (the relative efficiency of Nek5000 without in situ operations is ≈ 99 %). Through profiling with Arm MAP, we identified a bottleneck in the image composition step (that uses the Radix-kr algorithm) where a majority of the time is spent on MPI communication. We also identified an imbalance of in situ processing time between rank 0 and all other ranks. In our case, better scaling and load-balancing in the parallel image composition would considerably improve the performance of Nek5000 with in situ capabilities. In general, the result of this study highlights the technical challenges posed by the integration of high-performance simulation codes and data-analysis libraries and their practical use in complex cases, even when efficient algorithms already exist for a certain application scenario.
3.	Casas, Israel, et al. (författare) PSO-DS : a scheduling engine for scientific workflow managers 2017 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 73:9, s. 3924-3947 Tidskriftsartikel (refereegranskat)abstract Cloud computing, an important source of computing power for the scientific community, requires enhanced tools for an efficient use of resources. Current solutions for workflows execution lack frameworks to deeply analyze applications and consider realistic execution times as well as computation costs. In this study, we propose cloud user-provider affiliation (CUPA) to guide workflow's owners in identifying the required tools to have his/her application running. Additionally, we develop PSO-DS, a specialized scheduling algorithm based on particle swarm optimization. CUPA encompasses the interaction of cloud resources, workflow manager system and scheduling algorithm. Its featured scheduler PSO-DS is capable of converging strategic tasks distribution among resources to efficiently optimize makespan and monetary cost. We compared PSO-DS performance against four well-known scientific workflow schedulers. In a test bed based on VMware vSphere, schedulers mapped five up-to-date benchmarks representing different scientific areas. PSO-DS proved its efficiency by reducing makespan and monetary cost of tested workflows by 75 and 78%, respectively, when compared with other algorithms. CUPA, with the featured PSO-DS, opens the path to develop a full system in which scientific cloud users can run their computationally expensive experiments.
4.	Cebrián, Juan M., et al. (författare) Leakage-efficient design of value predictors through state and non-state preserving techniques 2011 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 55:1, s. 28-50 Tidskriftsartikel (refereegranskat)
5.	Cebrian, Juan M., et al. (författare) Managing power constraints in a single-core scenario through power tokens 2014 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 68:1, s. 414-442 Tidskriftsartikel (refereegranskat)abstract Current microprocessors face constant thermal and power-related problems during their everyday use, usually solved by applying a power budget to the processor/core. Dynamic voltage and frequency scaling (DVFS) has been an effective technique that allowed microprocessors to match a predefined power budget. However, the continuous increase of leakage power due to technology scaling along with low resolution of DVFS makes it less attractive as a technique to match a predefined power budget as technology goes to deep-submicron. In this paper, we propose the use of microarchitectural techniques to accurately match a power constraint while maximizing the energy-efficiency of the processor. We will predict the processor power dissipation at cycle level (power token throttling) or at a basic block level (basic block level mechanism), using the dissipated power translated into tokens to select between different power-saving microarchitectural techniques. We also introduce a two-level approach in which DVFS acts as a coarse-grain technique to lower the average power dissipation towards the power budget, while microarchitectural techniques focus on removing the numerous power spikes. Experimental results show that the use of power-saving microarchitectural techniques in conjunction with DVFS is up to six times more precise, in terms of total energy consumed over the power budget, than only using DVFS to match a predefined power budget.
6.	Daneshtalab, Masoud, et al. (författare) In-order delivery approach for 2D and 3D NoCs 2015 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 71:8, s. 2877-2899 Tidskriftsartikel (refereegranskat)abstract In many applications, it is critical to guarantee the in-order delivery of requests from the master cores to the slave cores, so that the requests can be executed in the correct order without requiring buffers. Since in NoCs packets may use different paths and on the other hand traffic congestion varies on different routes, the in-order delivery constraint cannot be met without support. To guarantee the in-order delivery, traditional approaches either use dimension-order routing or employ reordering buffers at network interfaces. Dimension-order routing degrades the performance considerably while the usage of reordering buffers imposes large area overhead. In this paper, we present a mechanism allowing packets to be routed through multiple paths in the network, helping to balance the traffic load while guaranteeing the in-order delivery. The proposed method combines the advantages of both deterministic and adaptive routing algorithms. The simple idea is to use different deterministic algorithms for independent flows. This approach neither requires reordering buffers nor limits packets to use a single path. The algorithm is simple and practical with negligible area overhead over dimension-order routing. The concept is investigated in both 2D and 3D mesh networks.
7.	Dastgeer, Usman, 1985-, et al. (författare) Performance-aware Composition Framework for GPU-based Systems 2015 Ingår i: Journal of Supercomputing. - : Springer. - 0920-8542 .- 1573-0484. ; 71:12, s. 4646-4662 Tidskriftsartikel (refereegranskat)abstract User-level components of applications can be made performance-aware by annotating them with performance model and other metadata. We present a component model and a composition framework for the automatically optimized composition of applications for modern GPU-based systems from such components, which may expose multiple implementation variants. The framework targets the composition problem in an integrated manner, with the ability to do global performance-aware composition across multiple invocations. We demonstrate several key features of our framework relating to performance-aware composition including implementation selection, both with performance characteristics being known (or learned) beforehand as well as cases when they are learned at runtime. We also demonstrate hybrid execution capabilities of our framework on real applications. Furthermore, we present a bulk composition technique that can make better composition decisions by considering information about upcoming calls along with data flow information extracted from the source program by static analysis. The bulk composition improves over the traditional greedy performance aware policy that only considers the current call for optimization.
8.	de Blanche, Andreas, 1975-, et al. (författare) Addressing characterization methods for memory contention aware co-scheduling 2015 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 71:4, s. 1451-1483 Tidskriftsartikel (refereegranskat)abstract The ability to precisely predict how memory contention degrades performance when co-scheduling programs is critical for reaching high performance levels in cluster, grid and cloud environments. In this paper we present an overview and compare the performance of state-of-the-art characterization methods for memory aware (co-)scheduling. We evaluate the prediction accuracy and co-scheduling performance of four methods: one slowdown-based, two cache-contention based and one based on memory bandwidth usage. Both our regression analysis and scheduling simulations find that the slowdown based method, represented by Memgen, performs better than the other methods. The linear correlation coefficient (Formula presented.) of Memgen's prediction is 0.890. Memgen's preferred schedules reached 99.53 % of the obtainable performance on average. Also, the memory bandwidth usage method performed almost as well as the slowdown based method. Furthermore, while most prior work promote characterization based on cache miss rate we found it to be on par with random scheduling of programs and highly unreliable.
9.	Elmisery, Ahmed M., et al. (författare) Privacy-enhanced middleware for location-based sub-community discovery in implicit social groups 2016 Ingår i: Journal of Supercomputing. - : Springer Science+Business Media B.V.. - 0920-8542 .- 1573-0484. ; 72:1, s. 247-274 Tidskriftsartikel (refereegranskat)abstract In our connected world, recommender services have become widely known for their ability to provide expert and personalize information to participants of diverse applications. The excessive growth of social networks, a new kind of services are being embraced which are termed as "group based recommendation services", where recommender services can be utilized to discover sub-communities within implicit social groups and provide referrals to new participants in order to join various sub-communities of other participants who share similar preferences or interests. Nevertheless, protecting participants' privacy in recommendation services is a quite crucial aspect which might prevent participants from exchanging their own data with these services, which in turn detain the accuracy of the generated referrals. So in order to gain accurate referrals, recommendation services should have the ability to discover previously unknown sub-communities from different social groups in a way to preserve privacy of participants in each group. In this paper, we present a middleware that runs on end-users' mobile phones to sanitize their profiles' data when released for generating referrals, such that computation of referrals continues over the sanitized version of their profiles' data. The proposed middleware is equipped with cryptography protocols to facilitate private discovery of sub-communities from the sanitized version of participants' profiles in a university scenario. Location data are added to participants' profiles to improve the awareness of surrounding sub-communities, so the offered referrals can be filtered based on adjacent locations for participant's location. We performed a number of different experiments to test the efficiency and accuracy of our protocols. We also developed a formal model for the tradeoff between privacy level and accuracy of referrals. As supported by the experiments, the sub-communities were correctly identified with good accuracy and an acceptable privacy level.
10.	Elmroth, Erik, 1964-, et al. (författare) High Performance Computations for Large Scale Simulations of Subsurface Multiphase Fluid and Heat Flow 2001 Ingår i: Journal of Supercomputing. - 0920-8542 .- 1573-0484. ; 18:3, s. 235-258 Tidskriftsartikel (refereegranskat)abstract TOUGH2 is a widely used reservoir simulator for solving subsurface flow related problems such as nuclear waste geologic isolation, environmental remediation of soil and groundwater contamination, and geothermal reservoir engineering. It solves a set of coupled mass and energy balance equations using a finite volume method. This contribution presents the design and analysis of a parallel version of TOUGH2. The parallel implementation first partitions the unstructured computational domain. For each time step, a set of coupled non-linear equations is solved with Newton iteration. In each Newton step, a Jacobian matrix is calculated and an ill-conditioned non-symmetric linear system is solved using a preconditioned iterative solver. Communication is required for convergence tests and data exchange across partitioning borders. Parallel performance results on Cray T3E-900 are presented for two real application problems arising in the Yucca Mountain nuclear waste site study. The execution time is reduced from 7504 seconds on two processors to 126 seconds on 128 processors for a 2D problem involving 52,752 equations. For a larger 3D problem with 293,928 equations the time decreases from 10,055 seconds on 16 processors to 329 seconds on 512 processors.
11.	Fang, Z., et al. (författare) Active memory controller 2012 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 1573-0484 .- 0920-8542. ; 62:1, s. 510-549 Tidskriftsartikel (refereegranskat)abstract Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs' performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.
12.	Farahnakian, Fahimeh, et al. (författare) Adaptive Load Balancing in Learning-based Approaches for Many-core Embedded Systems 2014 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 68:3, s. 1214-1234 Tidskriftsartikel (refereegranskat)abstract Adaptive routing algorithms improve network performance by distributingtraffic over the whole network. However, they require congestion information to facilitateload balancing. To provide local and global congestion information, we proposea learning method based on dual reinforcement learning approach. This informationcan be dynamically updated according to the changing traffic condition in the networkby propagating data and learning packets. We utilize a congestion detection methodwhich updates the learning rate according to the congestion level. This method calculatesthe average number of free buffer slots in each switch at specific time intervalsand compares it with maximum and minimum values. Based on the comparison result,the learning rate sets to a value between 0 and 1. If a switch gets congested, the learningrate is set to a high value, meaning that the global information is more important thanlocal. In contrast, local is more emphasized than global information in non-congestedswitches. Results show that the proposed approach achieves a significant performanceimprovement over the traditional Q-routing, DRQ-routing, DBAR and Dynamic XYalgorithms.
13.	Fazlali, M., et al. (författare) Efficient datapath merging for the overhead reduction of run-time reconfigurable systems 2012 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 1573-0484 .- 0920-8542. ; 59:2, s. 636-657 Tidskriftsartikel (refereegranskat)abstract High latencies in FPGA reconfiguration are known as a major overhead in run-time reconfigurable systems. This overhead can be reduced by merging multiple data flow graphs representing different kernels of the original program into a single (merged) datapath that will be configured less often compared to the separate datapaths scenario. However, the additional hardware introduced by this technique increases the kernels execution time. In this paper, we present a novel datapath merging technique that reduces both the configuration and execution times of kernels mapped on the reconfigurable fabric. Experimental results show up to 13% reduction in the configuration and execution times of kernels from media-bench workloads, compared to previous art on datapath merging. When compared to conventional high-level synthesis algorithms, our proposal reduces kernels configuration and execution times by up to 48%. © 2010 Springer Science+Business Media, LLC.
14.	Gaona, E., et al. (författare) Selective dynamic serialization for reducing energy consumption in hardware transactional memory systems 2014 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 1573-0484 .- 0920-8542. ; 68:2, s. 914-934 Tidskriftsartikel (refereegranskat)abstract In the search for new paradigms to simplify multithreaded programming, Transactional Memory (TM) is currently being advocated as a promising alternative to deadlock-prone lock-based synchronization. In this way, future many-core CMP architectures may need to provide hardware support for TM. On the other hand, power dissipation constitutes a first class consideration in multicore processor designs. In this work, we propose Selective Dynamic Serialization (SDS) as a new technique to improve energy consumption without degrading performance in applications with conflicting transactions by avoiding wasted work due to aborted transactions. Our proposal, which is implemented on top of a hardware transactional memory (HTM) system with an eager conflict management policy, detects and serializes conflicting transactions dynamically (at run-time). In its simplest form, in case of conflict, one transaction is allowed to continue whilst the rest are completely stalled. Once the executing transaction has finished, it wakes up several of the stalling transactions. More elaborated implementations of SDS try to delay this behavior until serialization of transactions is profitable, achieving the best trade-off between performance, energy savings and network traffic. SDS implementations differ from each other in the condition that triggers the serialization mode. We have evaluated several SDS schemes using GEMS, a full-system simulator implementing the LogTM-SE Eager-Eager HTM system, and several benchmarks from the STAMP suite. Results for a 16-core CMP show that SDS obtains reductions of 6 % on average in energy consumption (more than 20 % in high contention scenarios) in a wide range of benchmarks without affecting, on average, execution time. At the same time, network traffic level is also reduced by 22 %.
15.	Gong, Jing, et al. (författare) Nekbone performance on GPUs with OpenACC and CUDA Fortran implementations 2016 Ingår i: Journal of Supercomputing. - : Springer. - 0920-8542 .- 1573-0484. ; 72:11, s. 4160-4180 Tidskriftsartikel (refereegranskat)abstract We present a hybrid GPU implementation and performance analysis of Nekbone, which represents one of the core kernels of the incompressible Navier-Stokes solver Nek5000. The implementation is based on OpenACC and CUDA Fortran for local parallelization of the compute-intensive matrix-matrix multiplication part, which significantly minimizes the modification of the existing CPU code while extending the simulation capability of the code to GPU architectures. Our discussion includes the GPU results of OpenACC interoperating with CUDA Fortran and the gather-scatter operations with GPUDirect communication. We demonstrate performance of up to 552 Tflops on 16, 384 GPUs of the OLCF Cray XK7 Titan.
16.	Gong, Yueyuan, et al. (författare) Discovering sub-patterns from time series using a normalized cross-match algorithm 2016 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 72:10, s. 3850-3867 Tidskriftsartikel (refereegranskat)abstract Time series data stream mining has attracted considerable research interest in recent years. Pattern discovery is a challenging problem in time series data stream mining. Because the data update continuously and the sampling rates may be different, dynamic time warping (DTW)-based approaches are used to solve the pattern discovery problem in time series data streams. However, the naive form of the DTW-based approach is computationally expensive. Therefore, Toyoda proposed the CrossMatch (CM) approach to discover the patterns between two time series data streams (sequences), which requires only O(n) time per data update, where n is the length of one sequence. CM, however, does not support normalization, which is required for some kinds of sequences (e.g. stock prices, ECG data). Therefore, we propose a normalized-CrossMatch approach that extends CM to enforce normalization while maintaining the same performance capabilities.
17.	Goude, Anders, et al. (författare) Adaptive fast multipole methods on the GPU 2013 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 63, s. 897-918 Tidskriftsartikel (refereegranskat)
18.	Li, Lu, 1983-, et al. (författare) MeterPU: a generic measurement abstraction API: Enabling energy-tuned skeleton backend selection 2018 Ingår i: Journal of Supercomputing. - : SPRINGER. - 0920-8542 .- 1573-0484. ; 74:11, s. 5643-5658 Tidskriftsartikel (refereegranskat)abstract We present MeterPU, an easy-to-use, generic and low-overhead abstraction API for taking measurements of various metrics (time, energy) on different hardware components (e.g., CPU, DRAM, GPU) in a heterogeneous computer system, using pluggable platform-specific measurement implementations behind a common interface in C++. We show that with MeterPU, not only legacy (time) optimization frameworks, such as autotuned skeleton back-end selection, can be easily retargeted for energy optimization, but also switching between measurement metrics or techniques for arbitrary code sections now becomes trivial. We apply MeterPU to implement the first energy-tunable skeleton programming framework, based on the SkePU skeleton programming library.
19.	Liu, Felix, et al. (författare) A survey of HPC algorithms and frameworks for large-scale gradient-based nonlinear optimization 2022 Ingår i: Journal of Supercomputing. - : Springer Nature. - 0920-8542 .- 1573-0484. ; 78:16, s. 17513-17542 Tidskriftsartikel (refereegranskat)abstract Large-scale numerical optimization problems arise from many fields and have applications in both industrial and academic contexts. Finding solutions to such optimization problems efficiently requires algorithms that are able to leverage the increasing parallelism available in modern computing hardware. In this paper, we review previous work on parallelizing algorithms for nonlinear optimization. To introduce the topic, the paper starts by giving an accessible introduction to nonlinear optimization and high-performance computing. This is followed by a survey of previous work on parallelization and utilization of high-performance computing hardware for nonlinear optimization algorithms. Finally, we present a number of optimization software libraries and how they are able to utilize parallel computing today. This study can serve as an introduction point for researchers interested in nonlinear optimization or high-performance computing, as well as provide ideas and inspiration for future work combining these topics.
20.	Megzari, Abdelmoujib, et al. (författare) Applications, challenges, and solutions to single- and multi-objective critical node detection problems : a survey 2023 Ingår i: Journal of Supercomputing. - : Springer Nature. - 0920-8542 .- 1573-0484. ; 79:17, s. 19770-19808 Tidskriftsartikel (refereegranskat)abstract Recognizing critical nodes in complex networks has emerged as a challenging task across several application areas. The critical node detection problem (CNDP) is an optimization challenge that entails determining the subset of nodes whose removal adversely affects network connectivity and performance based on certain predetermined criteria. The problem of recognizing critical nodes has received significant consideration since it is a vital challenge in a multitude of application areas. As a result, many variants have been proposed on the basis of numerous metrics. In this survey, we discuss different applications, challenges, and solutions to single- and multi-objective CNDP. We review and classify different recent advancements and obtained outcomes for each variant, proposed from 2017 to 2022. To our best knowledge, this is the first survey on the heuristic optimization-based solutions for CNDP that have been developed in recent years. This study also provides researchers with future insight into filling gaps in the critical nodes research field and identifying emerging research trends in this area.
21.	Min-Allah, Nasro, et al. (författare) Deployment of real-time systems in the cloud environment 2021 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 77:2, s. 2069-2090 Tidskriftsartikel (refereegranskat)abstract Interest in real-time systems has grown considerably over recent years, primarily due to significant increase in the use of smart technologies and latency-sensitive applications such as cloud gaming, audio/video streaming, and smart homes. Significant work has been done on resource mapping in the cloud environment, and a number of promising results have been established accordingly where the focus is mainly on resource provisioning. However, the applicability of cloud computing services for real-time systems generated from smart systems is still in its infancy and remains unexplored, relatively. To address this gap, we propose a model for the smart systems that periodically offload computational workload to the cloud environment where virtual machines are allocated according to rate-monotonic scheduling policy to ensure requests are processed within the associated deadlines. Deadlines of tasks have been relaxed to improve server utilization as well as maintain a level of confidence in the timing constrains. Experimental results are discussed to highlight the applicability of static priority assignment for the workload in the context of virtual machines allocation.
22.	Mohseni, Zeynab, et al. (författare) A Deadlock-free Routing Algorithm for Irregular 3D Network-on-Chips with Wireless Links 2018 Ingår i: Journal of Supercomputing. - : Springer. - 0920-8542 .- 1573-0484. ; 74:2, s. 953-969 Tidskriftsartikel (övrigt vetenskapligt/konstnärligt)abstract In recent years, the idea of wireless three-dimensional network-on-chips (3D NoCs) was promoted in order to design many-core chips with greater performance and lower energy consumption. This technology is the combination of different dies that are stacked on each other. Therefore, it is necessary to propose a suitable routing mechanism for irregular wireless 3D NoCs that can support the agnostic topologies. In this paper, we propose a deadlock-free routing algorithm for wireless 3D NoCs, called Floyd-base Inter-chip Traffic distribution (FIT), which is based on Floyd routing algorithm. In FIT algorithm, the number of hops is reduced compared to the already established deterministic algorithms; moreover, the traffic distribution is improved. Evaluation results show that our proposed routing algorithm significantly improves the performance and throughput by reducing the energy consumption, the average hop count and the communication latency.
23.	Rahmani, Amir-Mohammad, et al. (författare) Special section on advances in methods for adaptive multicore systems 2014 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 68:3, s. 1023-1026 Tidskriftsartikel (refereegranskat)
24.	Shimchenko, Marina, et al. (författare) Analysing software prefetching opportunities in hardware transactional memory 2022 Ingår i: Journal of Supercomputing. - : Springer Nature. - 0920-8542 .- 1573-0484. ; 78:1, s. 919-944 Tidskriftsartikel (refereegranskat)abstract Hardware transactional memory emerged to make parallel programming more accessible. However, the performance pitfall of this technique is squashing speculatively executed instructions and re-executing them in case of aborts, ultimately resorting to serialization in case of repeated conflicts. A significant fraction of aborts occurs due to conflicts (concurrent reads and writes to the same memory location performed by different threads). Our proposal aims to reduce conflict aborts by reducing the window of time during which transactional regions can suffer conflicts. We achieve this by using software prefetching instructions inserted automatically at compile-time. Through these prefetch instructions, we intend to bring the necessary data for each transaction from the main memory to the cache before the transaction itself starts to execute, thus converting the otherwise long latency cache misses into hits during the execution of the transaction. The obtained results show that our approach decreases the number of aborts by 30% on average and improves performance by up to 19% and 10% for two out of the eight evaluated benchmarks. We provide insights into when our technique is beneficial given certain characteristics of the transactional regions, the advantages and disadvantages of our approach, and finally, discuss potential solutions to overcome some of its limitations.
25.	Sinaei, Sima, et al. (författare) Multi-objective algorithms for the application mapping problem in heterogeneous multiprocessor embedded system design 2019 Ingår i: Journal of Supercomputing. - : Springer New York LLC. - 0920-8542 .- 1573-0484. ; 75:8, s. 4150-4176 Tidskriftsartikel (refereegranskat)abstract Design at the Electronic System-Level tackles the increasing complexity of embedded systems by raising the level of abstraction in system specification and modeling. Two important steps in this process are evaluation of a single design configuration and design space exploration. The exponential size of the design space, along with the complex task of simulating a single design point, makes it impossible to explore the design space efficiently in almost all MPSoC design situations. In order to overcome this problem, one or both of the main steps of the design process (i.e., simulation and exploration) must be accelerated. In this paper, for the first part of the design process, high-level analytical models for application mapping and evaluation are presented in order to accelerate the evaluation of a single design configuration. In the second part of the design process, two multi-objective optimization algorithms that are based on particle swarm optimization and simulated annealing have been proposed for performing design space exploration. Considering multimedia applications as case studies, each of these methods produces a set of near-optimal points. Simulation results show that the proposed methods can lead to near-optimal design configurations with acceptable accuracy in a reasonable time.
26.	Thoman, Peter, et al. (författare) A taxonomy of task-based parallel programming technologies for high-performance computing 2018 Ingår i: Journal of Supercomputing. - : SPRINGER. - 0920-8542 .- 1573-0484. ; 74:4, s. 1422-1434 Tidskriftsartikel (refereegranskat)abstract Task-based programming models for shared memory-such as Cilk Plus and OpenMP 3-are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.
27.	Viebke, Andre, et al. (författare) CHAOS : A Parallelization Scheme for Training Convolutional Neural Networks on Intel Xeon Phi 2019 Ingår i: Journal of Supercomputing. - : Springer. - 0920-8542 .- 1573-0484. ; 75:1, s. 197-227 Tidskriftsartikel (refereegranskat)abstract Deep learning is an important component of big-data analytic tools and intelligent applications, such as, self-driving cars, computer vision, speech recognition, or precision medicine. However, the training process is computationally intensive, and often requires a large amount of time if performed sequentially. Modern parallel computing systems provide the capability to reduce the required training time of deep neural networks.In this paper, we present our parallelization scheme for training convolutional neural networks (CNN) named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS). Major features of CHAOS include the support for thread and vector parallelism, non-instant updates of weight parameters during back-propagation without a significant delay, and implicit synchronization in arbitrary order. CHAOS is tailored for parallel computing systems that are accelerated with the Intel Xeon Phi. We evaluate our parallelization approach empirically using measurement techniques and performance modeling for various numbers of threads and CNN architectures. Experimental results for the MNIST dataset of handwritten digits using the total number of threads on the Xeon Phi show speedups of up to 103x compared to the execution on one thread of the Xeon Phi, 14x compared to the sequential execution on Intel Xeon E5, and 58x compared to the sequential execution on Intel Core i5.
28.	Wang, Hui, et al. (författare) Group improved enhanced dynamic frame slotted ALOHA anti-collision algorithm 2014 Ingår i: Journal of Supercomputing. - : Springer Science and Business Media LLC. - 0920-8542 .- 1573-0484. ; 69:3, s. 1235-1253 Tidskriftsartikel (refereegranskat)abstract With the development of information technology and declining in the cost of tags, radio frequency identification (RFID) system has become more and more popular, which has been widely used in a lot of areas, such as logistics tracking, animals identification, medicine, electronic toll collection, inventory, asset management, manufacturing, etc. However, when we use RFID technology to identify the objects, tag collision is one of the important factors to influence the identification efficiency. Currently, Aloha-based algorithm is one of the popular anti-collision algorithms which performs well when the number of tags is small. But it is not very efficient for cases with large number of tags and some areas which tags' number can be estimated, such as warehouse, supermarket, the production lines of smart factory and so on. So in this paper, we proposed a new anti-collision algorithm called group improved enhanced dynamic frame slotted ALOHA (GroupIEDFSA) by estimating the number of unread tags first, comparing the maximum frame size and dividing tags into groups when the number of tags which are activated is large. What is more, compared with enhanced dynamic frame slotted ALOHA (EDFSA) algorithm in the process of identification, GroupIEDFSA algorithm will combine new group based on the unread tags' number. Simulation results show that the efficiency of GroupIEDFSA algorithm system improves by 20 % in time and over 50 % in rounds than EDFSA algorithm in the standard mode, and increases by 1 % in time when we used fast mode.
29.	Yazdanpanah, Fahimeh, et al. (författare) An energy-efficient partition-based XYZ-planar routing algorithm for a wireless network-on-chip 2019 Ingår i: Journal of Supercomputing. - : SPRINGER. - 0920-8542 .- 1573-0484. ; 75:2, s. 837-861 Tidskriftsartikel (refereegranskat)abstract In the current many-core architectures, network-on-chips (NoCs) have been efficiently utilized as communication backbones for enabling massive parallelism and high degree of integration on a chip. In spite of the advantages of conventional NoCs, wired multi-hop links impose limitations on their performance by long delay and much power consumption especially in large systems. To overcome these limitations, different solutions such as using wireless interconnections have been proposed. Utilizing long-range, high bandwidth and low power wireless links can lead to solve the problems corresponding to wired links. Meanwhile, the grid-like mesh is the most stable topology in conventional NoC designs. That is why most of the wireless network-on-chip (WNoC) architectures have been designed based on this topology. The goals of this article are to challenge mesh topology and to demonstrate the efficiency of honeycomb-based WNoC architectures. In this article, we propose HoneyWiN, hybrid wired/wireless NoC architecture with honeycomb topology. Also, a partition-based XYZ-planar routing algorithm for energy conservation is proposed. In order to demonstrate the advantages of the proposed architecture, first, an analytical comparison of HoneyWiN with a mesh-based WNoC, as the baseline architecture, is carried out. In order to compare the proposed architecture, we implement our partition-based routing algorithm in the form of 2-axes coordinate system in the baseline architecture. Simulation results show that HoneyWiN reduces about 17% of energy consumption while increases the throughput by 10% compared to the mesh-based WNoC. Then, HoneyWiN is compared with four state-of-the-art mesh-based NoC architectures. In all of the evaluations, HoneyWiN provides higher performance in term of delay, throughput and energy consumption. Overall, the results indicate that HoneyWiN is very effective in improving throughput, increasing speed and reducing energy consumption.
30.	Öhberg, Tomas, et al. (författare) Hybrid CPU-GPU execution support in the skeleton programming framework SkePU 2020 Ingår i: Journal of Supercomputing. - : SPRINGER. - 0920-8542 .- 1573-0484. ; 76:7, s. 5038-5056 Tidskriftsartikel (refereegranskat)abstract In this paper, we present a hybrid execution backend for the skeleton programming framework SkePU. The backend is capable of automatically dividing the workload and simultaneously executing the computation on a multi-core CPU and any number of accelerators, such as GPUs. We show how to efficiently partition the workload of skeletons such as Map, MapReduce, and Scan to allow hybrid execution on heterogeneous computer systems. We also show a unified way of predicting how the workload should be partitioned based on performance modeling. With experiments on typical skeleton instances, we show the speedup for all skeletons when using the new hybrid backend. We also evaluate the performance on some real-world applications. Finally, we show that the new implementation gives higher and more reliable performance compared to an old hybrid execution implementation based on dynamic scheduling.
31.	Wang, Pei, et al. (författare) Short-term effects of nutrient compensation following whole-tree harvesting on soil and soil water chemistry in a young Norway spruce stand 2010 Ingår i: Plant and Soil. - : Springer Science and Business Media LLC. - 0032-079X .- 1573-5036. ; 336, s. 323-336 Tidskriftsartikel (refereegranskat)abstract A growing demand for bioenergy from conventional forestry in Sweden will increase the need of nutrient compensation, that preferably should be made relatively shortly after harvesting and have no undesired side-effects. This study compared the effects of granulated wood ash (Ash), N-free, dolomite-based fertiliser (Vitality) and the green fraction of harvest residues (Residues) on the podsolic soil and soil solution of a young Norway spruce (Picea abies (L.) Karst) stand in SW Sweden. The treatments were applied three years after clear-felling and whole-tree harvesting. The soil solution was repeatedly sampled in the rooting zone 2-5 years after treatment. The soil study was performed 4 years after the Ash treatment and 3 years after Residues treatment and the last Vitality treatment (the Vitality treatment was applied on two occasions over 2 years). The Vitality treatment increased base saturation and effective CEC in the humus layer in relation to the other treatments, and also increased Ca and K concentrations in the soil solution. The Ash treatment resulted in higher exchangeable K concentration than Vitality in the litter layer, and Residues increased K concentrations in the soil water. No treatment influenced the KCl-exchangeable nitrate concentrations in the soil or the nitrate levels in the soil water. The results indicate that granulated wood ash could be used for long-term nutrient compensation without undesired short-term side-effects.
32.	Yuksekdag, Yusuf (författare) Health Without Care? Vulnerability, Medical Brain Drain, and Health Worker Responsibilities in Underserved Contexts 2018 Ingår i: Health Care Analysis. - : Springer. - 1065-3058 .- 1573-3394. ; 26:1, s. 17-32 Tidskriftsartikel (refereegranskat)abstract There is a consensus that the effects of medical brain drain, especially in the Sub-Saharan African (SSA) countries, ought to be perceived as more than a simple misfortune. Temporary restrictions on the emigration of health workers from the region is one of the already existing policy measures to tackle the issue - while such a restrictive measure brings about the need for quite a justificatory work. A recent normative contribution to the debate by Gillian Brock provides a fruitful starting point. In the first step of her defence of emigration restrictions, Brock provides three reasons why skilled workers themselves would hold responsibilities to assist with respect to vital needs of their compatriots. These are fair reciprocity, duty to support vital institutions, and attending to the unintended harmful consequences of one's actions. While the first two are explained and also largely discussed in the literature, the third requires an explication on how and on which basis skilled workers would have a responsibility as such. In this article, I offer a vulnerability approach with its dependency aspect that may account for why the health workers in underserved contexts would have a responsibility to attend to the unintended side effects of their actions that may lead to a vital risk of harm for the population. I discuss HIV/AIDS care in Zimbabwe as a case in point in order to show that local health workers may have responsibilities to assist the population who are vulnerable to their mobility.

Skapa referenser, mejla, bekava och länka

Länka till träfflistan

Träfflista för sökning "L773:1573 0484 OR L773:0920 8542 "

Avgränsa träffmängd

År