SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Tordsson Johan 1980 ) "

Sökning: WFRF:(Tordsson Johan 1980 )

  • Resultat 1-50 av 68
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Ali-Eldin, Ahmed, et al. (författare)
  • An adaptive hybrid elasticity controller for cloud infrastructures
  • 2012
  • Ingår i: 2012 IEEE Network operations and managent symposium (NOMS). - : IEEE Communications Society. - 9781467302685 ; , s. 204-212
  • Konferensbidrag (refereegranskat)abstract
    • Cloud elasticity is the ability of the cloud infrastructure to rapidly change the amount of resources allocated to a service in order to meet the actual varying demands on the service while enforcing SLAs. In this paper, we focus on horizontal elasticity, the ability of the infrastructure to add or remove virtual machines allocated to a service deployed in the cloud. We model a cloud service using queuing theory. Using that model we build two adaptive proactive controllers that estimate the future load on a service. We explore the different possible scenarios for deploying a proactive elasticity controller coupled with a reactive elasticity controller in the cloud. Using simulation with workload traces from the FIFA world-cup web servers, we show that a hybrid controller that incorporates a reactive controller for scale up coupled with our proactive controllers for scale down decisions reduces SLA violations by a factor of 2 to 10 compared to a regression based controller or a completely reactive controller.
  •  
2.
  • Ali-Eldin, Ahmed, et al. (författare)
  • Efficient provisioning of bursty scientific workloads on the cloud using adaptive elasticity control
  • 2012
  • Ingår i: Proceedings of the 3rd workshop on Scientific Cloud Computing Date. - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450313407 - 145031340X ; , s. 31-40
  • Konferensbidrag (refereegranskat)abstract
    • Elasticity is the ability of a cloud infrastructure to dynamically change theamount of resources allocated to a running service as load changes. We build anautonomous elasticity controller that changes the number of virtual machinesallocated to a service based on both monitored load changes and predictions offuture load. The cloud infrastructure is modeled as a G/G/N queue. This modelis used to construct a hybrid reactive-adaptive controller that quickly reactsto sudden load changes, prevents premature release of resources, takes intoaccount the heterogeneity of the workload, and avoids oscillations. Using simulations with Web and cluster workload traces, we show that our proposed controller lowers the number of delayed requests by a factor of 70 for the Web traces and 3 for the cluster traces when compared to a reactive controller. Ourcontroller also decreases the average number of queued requests by a factor of 3 for both traces, and reduces oscillations by a factor of 7 for the Web traces and 3 for the cluster traces. This comes at the expense of between 20% and 30% over-provisioning, as compared to a few percent for the reactive controller.
  •  
3.
  • Ali-Eldin, Ahmed, 1985-, et al. (författare)
  • Workload Classification for Efficient Auto-Scaling of Cloud Resources
  • 2013
  • Annan publikation (övrigt vetenskapligt/konstnärligt)abstract
    • Elasticity algorithms for cloud infrastructures dynamically change the amount of resources allocated to a running service according to the current and predicted future load. Since there is no perfect predictor, and since different applications’ workloads have different characteristics, no single elasticity algorithm is suitable for future predictions for all workloads. In this work, we introduceWAC, aWorkload Analysis and Classification tool that analyzes workloads and assigns them to the most suitable elasticity controllers based on the workloads’ characteristics and a set of business level objectives.WAC has two main components, the analyzer and the classifier. The analyzer analyzes workloads to extract some of the features used by the classifier, namely, workloads’ autocorrelations and sample entropies which measure the periodicity and the burstiness of the workloads respectively. These two features are used with the business level objectives by the clas-sifier as the features used to assign workloads to elasticity controllers. We start by analyzing 14 real workloads available from different applications. In addition, a set of 55 workloads is generated to test WAC on more workload configurations. We implement four state of the art elasticity algorithms. The controllers are the classes to which the classifier assigns workloads. We use a K nearest neighbors classifier and experiment with different workload combinations as training and test sets. Our experi-ments show that, when the classifier is tuned carefully, WAC correctly classifies between 92% and 98.3% of the workloads to the most suitable elasticity controller.
  •  
4.
  • Arkian, Hamidreza, et al. (författare)
  • An Experiment-Driven Performance Model of Stream Processing Operators in Fog Computing Environments
  • 2020
  • Ingår i: SAC '20: Proceedings of the 35th Annual ACM Symposium on Applied Computing. - New York, NY, USA : ACM Digital Library. ; , s. 1763-1771
  • Konferensbidrag (refereegranskat)abstract
    • Data stream processing (DSP) is an interesting computation paradigm in geo-distributed infrastructures such as Fog computing because it allows one to decentralize the processing operations and move them close to the sources of data. However, any decomposition of DSP operators onto a geo-distributed environment with large and heterogeneous network latencies among its nodes can have significant impact on DSP performance. In this paper, we present a mathematical performance model for geo-distributed stream processing applications derived and validated by extensive experimental measurements. Using this model, we systematically investigate how different topological changes affect the performance of DSP applications running in a geo-distributed environment. In our experiments, the performance predictions derived from this model are correct within ±2% even in complex scenarios with heterogeneous network delays between every pair of nodes.
  •  
5.
  • Arkian, Hamidreza, et al. (författare)
  • Model-based Stream Processing Auto-scaling in Geo-Distributed Environments
  • 2021
  • Ingår i: 2021 International Conference on Computer Communications and Networks (ICCCN). - : IEEE.
  • Konferensbidrag (refereegranskat)abstract
    • Data stream processing is an attractive paradigm for analyzing IoT data at the edge of the Internet before transmitting processed results to a cloud. However, the relative scarcity of fog computing resources combined with the workloads' nonstationary properties make it impossible to allocate a static set of resources for each application. We propose Gesscale, a resource auto-scaler which guarantees that a stream processing application maintains a sufficient Maximum Sustainable Throughput to process its incoming data with no undue delay, while not using more resources than strictly necessary. Gesscale derives its decisions about when to rescale and which geo-distributed resource(s) to add or remove on a performance model that gives precise predictions about the future maximum sustainable throughput after reconfiguration. We show that this auto-scaler uses 17% less resources, generates 52% fewer reconfigurations, and processes more input data than baseline auto-scalers based on threshold triggers or a simpler performance model.
  •  
6.
  • Armstrong, Django, et al. (författare)
  • Contextualization : dynamic configuration of virtual machines
  • 2015
  • Ingår i: Journal of Cloud Computing. - : Springer. - 2192-113X. ; 4:17
  • Tidskriftsartikel (refereegranskat)abstract
    • New VM instances are created from static templates that contain the basic configuration of the VM to achieve elasticity with regards to capacity. Instance specific settings can be injected into the VM during the deployment phase through means of contextualization. So far this is limited to a single data source and data remains static throughout the lifecycle of the VM.We present a layered approach to contextualization that supports different classes of contextualization data available from several sources. The settings are made available to the VM through virtual devices. Inside each VM data from different classes are layered on top of each other to create a unified file hierarchy.Context data can be modified during runtime by updating the contents of the virtual devices, making our approach the first contextualization approach to natively support recontextualization. Recontextualization enables runtime reconfiguration of an executing service and can act as a trigger and key enabler of self-* techniques. This trigger provides a service with a mechanism to adapt or optimize itself in response to a changing environment. The runtime reconfiguration using recontextualization and its potential gains are illustrated in an example with a distributed file system, demonstrating the feasibility of our approach.
  •  
7.
  • Badia, Rosa M., et al. (författare)
  • Demonstration of the OPTIMIS Toolkit for Cloud Service Provisioning
  • 2011
  • Ingår i: Towards a Service-Based Internet. - Berlin, Heidelberg : Springer Berlin/Heidelberg. - 9783642247545 - 9783642247552 ; , s. 331-333
  • Konferensbidrag (refereegranskat)abstract
    • We demonstrate the OPTIMIS toolkit for scalable and dependable service platforms and architectures that enable flexible and dynamic provisioning of Cloud services. The innovations demonstrated are aimed at optimizing Cloud services and infrastructures based on aspects such as trust, risk, eco-efficiency, cost, performance and legal constraints. Adaptive self-preservation is part of the toolkit to meet predicted and unforeseen changes in resource requirements. By taking into account the whole service life cycle, the multitude of future Cloud architectures, and a by taking a holistic approach to sustainable service provisioning, the toolkit provides a foundation for a reliable, sustainable, and trustful Cloud computing industry.
  •  
8.
  •  
9.
  •  
10.
  • Berglund, Ann-Charlotte, et al. (författare)
  • Combining local and grid resources in scientific workflows (for Bioinformatics)
  • 2009
  • Konferensbidrag (refereegranskat)abstract
    • We examine some issues that arise when using both local and Gridresources in scientific workflows. Our previous work addresses and illustratesthe benefits of a light-weight and generic workflow engine that manages andoptimizes Grid resource usage. Extending on this effort, we hereillustrate how a client tool for bioinformatics applications employs the engine tointerface with Grid resources. We also explore how to define data flowsthat transparently integrates local and Grid subworkflows. In addition, the benefits of parameter sweep workflows are examined and a means for describing this type of workflows in an abstract and concise manner is introduced. Finally, the above mechanisms are employed to perform an orthology detection analysis.
  •  
11.
  • Breitgand, D., et al. (författare)
  • Policy-Driven Service Placement Optimization in Federated Clouds
  • 2011
  • Annan publikation (övrigt vetenskapligt/konstnärligt)abstract
    • Efficient provisioning of elastic services constitutes a significant management challenge for cloud computing providers. We consider a federated cloud paradigm, where one cloud can subcontract workloads to partnering clouds to meet peaks in demand without costly over-provisioning. We propose a model for service placement in federated clouds to maximize profit while protecting Quality of Service (QoS) as specified in the Service Level Agreements (SLA) of the workloads. Our contributions include an Integer Linear Program (ILP) formulation of the generalized federated placement problem and application of this problem to load balancing and consolidation within a cloud, as well as for cost minimization for remote placement in partnering clouds. We also provide a 2-approximation algorithm based on a greedy rounding of a Linear Program (LP) relaxation of the problem. We implement our proposed approach in the context of the RESERVOIR architecture.
  •  
12.
  • Desmeurs, David, et al. (författare)
  • Event-Driven Application Brownout : Reconciling High Utilization and Low Tail Response Times
  • 2015
  • Ingår i: 2015 International Conference on Cloud and Autonomic Computing (ICCAC). - New York : IEEE Computer Society. - 9781467395663 ; , s. 1-12
  • Konferensbidrag (refereegranskat)abstract
    • Data centers currently waste a lot of energy, due to lack of energy proportionality and low resource utilization, the latter currently being necessary to ensure application responsiveness. To address the second concern we propose a novel application-level technique that we call event-driven Brownout. For each request, i.e., in an event-driven manner, the application can execute some optional code that is not required for correct operation but desirable for user experience, and does so only if the number of pending client requests is below a given threshold. We propose several autonomic algorithms, based on control theory and machine learning, to automatically tune this threshold based on measured application 95th percentile response times. We evaluate our approach using the RUBiS benchmark which shows a 11-fold improvement in maintaining response-time close to a set-point at high utilization compared to competing approaches. Our contribution is opening the path to more energy efficient data-centers, by allowing applications to keep response times close to a set-point even at high resource utilization.
  •  
13.
  • Elmroth, Erik, 1964-, et al. (författare)
  • A Grid Resource Broker Supporting Advance Reservations and Benchmark-based Resource Selection
  • 2006
  • Ingår i: Applied Parallel Computing. - Berlin, Heidelberg : Springer Verlag. - 3540290672 - 9783540290674 ; , s. 1061-1070
  • Konferensbidrag (refereegranskat)abstract
    • This contribution presents algorithms, methods, and software for a Grid resource manager, responsible for resource brokering and scheduling in early production Grids. The broker selects computing resources based on actual job requirements and a number of criteria identifying the available resources, with the aim to minimize the total time to delivery for the individual application. The total time to delivery includes the time for program execution, batch queue waiting, input/output data transfer, and executable staging. Main features of the resource manager include advance reservations, resource selection based on computer benchmark results and network performance predictions, and a basic adaptation facility.
  •  
14.
  • Elmroth, Erik, 1964-, et al. (författare)
  • A light-weight Grid workflow execution service enabling client and middleware independence
  • 2008
  • Ingår i: Parallel Processing and Applied Mathematics. - Berlin, Heidelberg : Springer-Verlag. ; , s. 754-761
  • Konferensbidrag (refereegranskat)abstract
    • We present a generic and light-weight Grid workflow execution engine made available as a Grid service. A long-term goal is to facilitate the rapid development of application-oriented end-user workflow tools, while providing a high degree of Grid middleware-independence. The workflow engine is designed for workflow execution, independent of client tools for workflow definition. A flexible plugin-structure for middleware-integration provides a strict separation of the workflow execution and the processing of individual tasks, such as computational jobs or file transfers. The light-weight design is achieved by focusing on the generic workflow execution components and by leveraging state-of-the art Grid technology, e.g., for state management. The current prototype is implemented using the Globus Toolkit 4 (GT4) Java WS Core and has support for executing workflows produced by Karajan. It also includes plugins for task execution with GT4 as well as a high-level Grid job management framework.
  •  
15.
  • Elmroth, Erik, 1964-, et al. (författare)
  • A standards-based Grid resource brokering service supporting advance reservations, coallocation and cross-Grid interoperability
  • 2009
  • Ingår i: Concurrency and Computation. - : Wiley. - 1532-0626 .- 1532-0634. ; 21:18, s. 2298-2335
  • Tidskriftsartikel (refereegranskat)abstract
    • The problem of Grid-middleware interoperability is addressed by the design and analysis of a feature-rich, standards-based framework for all-to-all cross-middleware job submission.The architecture is designed with focus on generality and flexibility and builds on extensive use, internally and externally, of (proposed) Web and Grid services standards such asWSRF, JSDL, GLUE, and WS-Agreement. The external use providesthe foundation for easy integration into specific middlewares,which is performed by the design of a small set of plugins for each middleware. Currently, plugins are provided for integrationinto Globus Toolkit 4 and NorduGrid/ARC. The internal use of standard formats facilitates customizationof the job submission service by replacement of custom components for performing specific well-defined tasks.Most importantly, this enables the easy replacement of resource selection algorithms by algorithms that addresses the specific needs of a particular Grid environment and job submission scenario.By default, the service implements a decentralized brokering policy, strivingto optimize the performance for the individual user by minimizing the response time for each job submitted. The algorithms in our implementation perform resource selectionbased on performance predictions, and provide support for advance reservations as well as coallocation of multiple resources for coordinated use.The performance of the system is analyzed with focuson overall service throughput (up to over 250 jobs per minute)and individual job submission response time (down to under one second).
  •  
16.
  • Elmroth, Erik, 1964-, et al. (författare)
  • An Advanced Grid Computing Course for Application and Infrastructure Developers
  • 2005
  • Ingår i: 2005 IEEE International Symposium on Cluster Computing and the Grid. - USA : IEEE Computer Society Press. ; , s. 43-50
  • Konferensbidrag (refereegranskat)abstract
    • This contribution presents our experiences from developing an advanced course in grid computing, aimed at application and infrastructure developers. The course was intended for computer science students with extensive programming experience and previous knowledge of distributed systems, parallel computing, computer networking, and security. The presentation includes brief presentations of all topics covered in the course, a list of the literature used, and descriptions of the mandatory computer assignments performed using Globus Toolkit 2 and 3. A summary of our experiences from the course and some suggestions for future directions concludes the presentation.
  •  
17.
  • Elmroth, Erik, 1964-, et al. (författare)
  • An Interoperable, Standards-Based Grid Resource Broker and Job Submission Service
  • 2005
  • Ingår i: First International Conference on e-Science and Grid Computing. - : IEEE Computer Society Press. - 0769524486 ; , s. 212-220, s. 212-220
  • Konferensbidrag (refereegranskat)abstract
    • We present the architecture and implementation of a grid resource broker and job submission service, designed to be as independent as possible of the grid middleware used on the resources. The overall architecture comprises seven general components and a few conversion and integration points where all middleware-specific issues are handled. The implementation is based on state-of-the-art grid and Web services technology as well as existing and emerging standards (WSRF, JSDL, GLUE, WS-Agreement). Features provided by the service include advance reservations and a resource selection process based on a priori estimations of the total time to delivery for the application, including a benchmark-based prediction of the execution time. The general service implementation is based on the Globus Toolkit 4. For test and evaluation, plugins and format converters are provided for use with the NorduGrid ARC middleware
  •  
18.
  • Elmroth, Erik, 1964-, et al. (författare)
  • Designing general, composable, and middleware-independent Grid infrastructure tools for multi-tiered job management
  • 2007
  • Ingår i: Towards Next Generation Grids. - : Springer-Verlag. - 9780387724973 ; , s. 175-184
  • Konferensbidrag (refereegranskat)abstract
    • We propose a multi-tiered architecture for middleware-independent Grid job management. The architecture consists of a number of services for well-defined tasks in the job management process, offering complete user-level isolation of servicecapabilities, multiple layers of abstraction, control, and fault tolerance. The middleware abstraction layer comprises components for targeted job submission, job control and resource discovery. The brokered job submission layer offers a Grid view on resources, including functionality for resource brokering and submission of jobs to selected resources. The reliable job submission layer includes components for fault tolerant execution of individual jobs and groups of independentjobs, respectively. The architecture is proposed as a composable set of tools rather than a monolithic solution, allowing users to select the individual components of interest. The prototype presented is implemented using the Globus Toolkit 4, integrated with the Globus Toolkit 4 and NorduGrid/ARC middlewares and based on existing and emerging Grid standards. A performance evaluation reveals that the overhead for resource discovery, brokering, middleware-specific format conversions, job monitoring, fault tolerance, and management of individual and groups of jobs is sufficiently small to motivate the use of the framework.
  •  
19.
  • Elmroth, Erik, 1964-, et al. (författare)
  • Designing service-based resource management tools for a healthy grid ecosystem
  • 2008
  • Ingår i: Parallel processing and applied mathematics. - Berlin, Heidelberg : Springer-Verlag. ; , s. 259-270
  • Konferensbidrag (refereegranskat)abstract
    • We present an approach for development of Grid resource management tools, where we put into practice internationally established high-level views of future Grid architectures. The approach addresses fundamental Grid challenges and strives towards a future vision of the Grid where capabilities are made available as independent and dynamically assembled utilities, enabling run-time changes in the structure, behavior, and location of software. The presentation is made in terms of design heuristics, design patterns, and quality attributes, and is centered around the key concepts of co-existence, composability, adoptability, adaptability, changeability, and interoperability. The practical realization of the approach is illustrated by five case studies (recently developed Grid tools) high-lighting the most distinct aspects of these key concepts for each tool. The approach contributes to a healthy Grid ecosystem that promotes a natural selection of “surviving” components through competition, innovation, evolution, and diversity. In conclusion, this environment facilitates the use and composition of components on a per-component basis.
  •  
20.
  • Elmroth, Erik, et al. (författare)
  • Grid resource brokering algorithms enabling advance reservations and resource selection based on performance predictions
  • 2008
  • Ingår i: Future generations computer systems. - Amsterdam : Elsevier. - 0167-739X .- 1872-7115. ; 24:6, s. 585-593
  • Tidskriftsartikel (refereegranskat)abstract
    • We present algorithms, methods, and software for a Grid resource manager, that performs resource brokering and job scheduling in production Grids. This decentralized broker selects computational resources based on actual job requirements, job characteristics, and information provided by the resources, with the aim to minimize the total time to delivery for the individual application. The total time to delivery includes the time for program execution, batch queue waiting, and transfer of executable and input/output data to and from the resource. The main features of the resource broker include two alternative approaches to advance reservations, resource selection algorithms based on computer benchmark results and network performance predictions, and a basic adaptation facility. The broker is implemented as a built-in component of a job submission client for the NorduGrid/ARC middleware.
  •  
21.
  •  
22.
  • Elmroth, Erik, 1964-, et al. (författare)
  • Resource Management for Early Production Grids
  • 2003
  • Rapport (populärvet., debatt m.m.)abstract
    • This contribution presents the ongoing development of a resource managerfor use in early production grids. Even though our main focus is todevelop a stable brokering facility for current production grids, we alsoaddress features needed in further improved resource managers for futureenhanced grid infrastructures. The primary target environment is theNorduGrid platform, comprising around 20 parallel systems in 5 countries,available for production grid jobs 24 hours a day. Application characteristicsconsidered include serial, parallel, and coordinated multi-resourcejobs running in sequence or in parallel, all types in either interactive ornon-interactive mode. The brokering process aims to minimize the timeto delivery for each individual job and is based on a number of new featuresincluding reservation capability, information about currently usedor reserved capacity, benchmark-scaled time predictions, and queue adaptationcapability. We present the basic motivations for all these featuresand discuss various issues regarding their implementations in the currentgrid environment.
  •  
23.
  • Elmroth, Erik, 1964-, et al. (författare)
  • Self-management challenges for multi-cloud architectures
  • 2011
  • Ingår i: Towards a Service-Based Internet. - Berlin, Heidelberg : Springer Berlin/Heidelberg. - 9783642247545 - 9783642247552 ; , s. 38-49
  • Konferensbidrag (refereegranskat)abstract
    • Addressing the management challenges for a multitude of distributed cloud architectures, we focus on the three complementary cloud management problems of predictive elasticity, admission control, and placement (or scheduling) of virtual machines. As these problems are intrinsically intertwined we also propose an approach to optimize the overall system behavior by policy-tuning for the tools handling each of them. Moreover, in order to facilitate the execution of some of the management decisions, we also propose new algorithms for live migration of virtual machines with very high workload and/or over low-bandwidth networks, using techniques such as caching, compression, and prioritization of memory pages.
  •  
24.
  • Elmroth, Erik, 1964-, et al. (författare)
  • Three fundamental dimensions of scientific workflow interoperability : model of computation, language, and execution environment
  • 2010
  • Ingår i: Future generations computer systems. - : Elsevier. - 0167-739X .- 1872-7115. ; 26:2, s. 245-256
  • Tidskriftsartikel (refereegranskat)abstract
    • We investigate interoperability aspects of scientific workflow systems and argue that the workflow execution environment, the model of computation (MoC), and the workflow language form three dimensions that must be considered depending on the type of interoperability sought: at the activity, sub-workflow, or workflow levels. With a focus on the problems that affect interoperability, we illustrate how these issues are tackled by current scientific workflows as well as how similar problems have been addressed in related areas. Our long-term objective is to achieve (logical) interoperability between workflow systems operating under different MoCs, using distinct language features, and sharing activities running on different execution environments.
  •  
25.
  • Espling, Daniel, 1983-, et al. (författare)
  • Modeling and Placement of Cloud Services with Internal Structure
  • 2016
  • Ingår i: IEEE Transactions on Cloud Computing. - : IEEE Computer Society. - 2168-7161. ; 4:4, s. 429-439
  • Tidskriftsartikel (refereegranskat)abstract
    • Virtual machine placement is the process of mapping virtual machines to available physical hosts within a datacenter or on a remote datacenter in a cloud federation. Normally, service owners cannot influence the placement of service components beyond choosing datacenter provider and deployment zone at that provider. For some services, however, this lack of influence is a hindrance to cloud adoption. For example, services that require specific geographical deployment (due e.g. to legislation), or require redundancy by avoiding co-location placement of critical components. We present an approach for service owners to influence placement of their service components by explicitly specifying service structure, component relationships, and placement constraints between components. We show how the structure and constraints can be expressed and subsequently formulated as constraints that can be used in placement of virtual machines in the cloud. We use an integer linear programming scheduling approach to illustrate the approach, show the corresponding mathematical formulation of the model, and evaluate it using a large set of simulated input. Our experimental evaluation confirms the feasibility of the model and shows how varying amounts of placement constraints and data center background load affects the possibility for a solver to find a solution satisfying all constraints within a certain time-frame. Our experiments indicate that the number of constraints affects the ability of finding a solution to a higher degree than background load, and that for a high number of hosts with low capacity, component affinity is the dominating factor affecting the possibility to find a solution.
  •  
26.
  • Kihl, Maria, et al. (författare)
  • The Challenge of Cloud Control
  • 2013
  • Ingår i: The 8th International Workshop on Feedback Computing (Feedback Computing '13).
  • Konferensbidrag (refereegranskat)abstract
    • Today’s cloud data center infrastructures are not even near being able to cope with the enormous and rapidly vary-ing capacity demands that will be reality in a near future. So far, very little is understood about how to transform today’s data centers (being large, power-hungry facilities, and operated through heroic efforts by numerous adminis-trators) into a self-managed, dynamic, and dependable infrastructure, constantly delivering expected QoS with rea-sonable operation costs and acceptable carbon footprint for large-scale services with sometimes dramatic variations in capacity demands. In this paper, we discuss some of the major challenges for resource-optimized cloud data cen-ter. We propose a new research area called Cloud Control, which is a control theoretic approach to a range of cloud management problems, aiming to transform today´s static and energy consuming cloud data centers into self-managed, dynamic, and dependable infrastructures, constantly delivering expected quality of service with acceptable operation costs and carbon footprint for large-scale services with varying capacity demands.
  •  
27.
  • Kolberg, Simon, et al. (författare)
  • Spreading the Heat: Multi-cloud Controller for Failover and Cross-site Offloading
  • 2020
  • Ingår i: Web, Artificial Intelligence and Network Applications. - Cham : Springer Nature. - 9783030440381 - 9783030440374 ; , s. 1154-1164
  • Konferensbidrag (refereegranskat)abstract
    • Despite the ubiquitous adoption of cloud computing and a very rich set of services offered by cloud providers, current systems lack efficient and flexible mechanisms to collaborate among multiple cloud sites. In order to guarantee resource availability during peaks in demand and to fulfill service level objectives, cloud service providers cap resource allocations and as a consequence, face severe underutilization during non-peak periods. In addition, application owners are forced to make independent contracts to deploy their application at different sites. To illustrate how these shortcomings can be overcome, we present a lightweight cross-site offloader for OpenStack. Our controller utilizes templates and site weights to enable offloading of virtual machines between geographically disperse sites. We present and implement a proposed architecture and demonstrate its feasibility in both a typical cross-site offloading, as well as a failover scenario.
  •  
28.
  • Kostentinos Tesfatsion, Selome, et al. (författare)
  • Virtualization techniques compared : performance, resource, and power usage overheads in clouds
  • 2018
  • Ingår i: ICPE 2018 - Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering. - New York, NY, USA : ACM Digital Library. - 9781450350952 ; , s. 145-156
  • Konferensbidrag (refereegranskat)abstract
    • Virtualization solutions based on hypervisors or containers are enabling technologies for scalable, flexible, and cost-effective resource sharing. As the fundamental limitations of each technology are yet to be understood, they need to be regularly reevaluated to better understand the trade-off provided by latest technological advances. This paper presents an in-depth quantitative analysis of virtualization overheads in these two groups of systems and their gaps relative to native environments based on a diverse set of workloads that stress CPU, memory, storage, and networking resources. KVM and XEN are used to represent hypervisor-based virtualization, and LXC and Docker for container-based platforms. The systems were evaluated with respect to several cloud resource management dimensions including performance, isolation, resource usage, energy efficiency, start-up time, and density. Our study is useful both to practitioners to understand the current state of the technology in order to make the right decision in the selection, operation and/or design of platforms and to scholars to illustrate how these technologies evolved over time.
  •  
29.
  • Li, Wubin, 1983-, et al. (författare)
  • A General Approach to Service Deployment in Cloud Environments
  • 2012
  • Ingår i: Cloud and Green Computing (CGC 2012). - : IEEE Computer Society. - 9780769548647 - 9781467330275 ; , s. 17-24
  • Konferensbidrag (refereegranskat)abstract
    • The cloud computing landscape has recently developed into a spectrum of cloud architectures, leading to a broad range of management tools for similar operations but specialized for certain deployment scenarios. This both hinders the efficient reuse of algorithmic innovations within cloud management operations and increases the heterogeneity between different management systems. Our overarching goal is to overcome these problems by developing tools general enough to support the full range of popular architectures. In this contribution, we analyze commonalities in recently proposed cloud models (private clouds, multi-clouds, bursted clouds, federated clouds, etc.), and demonstrate how a key management functionality - service deployment - can be uniformly performed in all of these by a carefully designed system. The design of our service deployment framework is validated through a demonstration of how it can be used to deploy services, perform bursting and brokering, as well as mediate a cloud federation in the context of the OPTIMIS Toolkit.
  •  
30.
  • Li, Wubin, 1983-, et al. (författare)
  • An aspect-oriented approach to consistency-preserving caching and compression of web service response messages
  • 2010
  • Ingår i: Web Services (ICWS 2010). - : IEEE Computer Society. - 9781424481460 - 9780769541280 ; , s. 526-533
  • Konferensbidrag (refereegranskat)abstract
    • Web Services communicate through XMLencoded messages and suffer from substantial overhead due to verbose encoding of transferred messages and extensive (de)serialization at the end-points. We demonstrate that response caching is an effective approach to reduce Internet latency and server load. Our Tantivy middleware layer reduces the volume of data transmitted without semantic interpretation of service requests or responses and thus improves the service response time. Tantivy achieves this reduction through the combined use of caching of recent responses and data compression techniques to decrease the data representation size. These benefits do not compromise the strict consistency semantics. Tantivy also decreases the overhead of message parsing via storage of application-level data objects rather than XMLrepresentations. Furthermore, we demonstrate how the use of aspect-oriented programming techniques provides modularity and transparency in the implementation. Experimental evaluations based on the WSTest benchmark suite demonstrate that our Tantivy system gives significant performance improvements compared to non-caching techniques.
  •  
31.
  • Li, Wubin, 1983-, et al. (författare)
  • Cost-Optimal cloud service placement under dynamic pricing schemes
  • 2013
  • Ingår i: 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing. - : IEEE Computer Society. - 9780769551524 ; , s. 187-194
  • Konferensbidrag (refereegranskat)abstract
    • Until now, most research on cloud service placement has focused on static pricing scenarios, where cloud providers offer fixed prices for their resources. However, with the recent trend of dynamic pricing of cloud resources, where the price of a compute resource can vary depending on the free capacity and load of the provider, new placement algorithms are needed. In this paper, we investigate service placement in dynamic pricing scenarios by evaluating a set of placement algorithms, tuned for dynamic pricing. The algorithms range from simple heuristics to combinatorial optimization solutions. The studied algorithms are evaluated by deploying a set of services across multiple providers. Finally, we analyse the strengths and weaknesses of the algorithms considered. The evaluation suggests that exhaustive search based approach is good at finding optimal solutions for service placement under dynamic pricing schemes, but the execution times are usually long. In contrast, greedy approaches perform surprisingly well with fast execution times and acceptable solutions, and thus can be a suitable compromise considering the tradeoffs between quality of solution and execution time.
  •  
32.
  • Li, Wubin, 1983-, et al. (författare)
  • Modeling for Dynamic Cloud Scheduling via Migration of Virtual Machines
  • 2011
  • Ingår i: Cloud Computing Technology and Science (CloudCom). - : IEEE Computer Society. - 9781467300902 ; , s. 163-171
  • Konferensbidrag (refereegranskat)abstract
    • Cloud brokerage mechanisms are fundamental to reduce the complexity of using multiple cloud infrastructures to achieve optimal placement of virtual machines and avoid the potential vendor lock-in problems. However, current approaches are restricted to static scenarios, where changes in characteristics such as pricing schemes, virtual machine types, and service performance throughout the service life-cycle are ignored. In this paper, we investigate dynamic cloud scheduling use cases where these parameters are continuously changed, and propose a linear integer programming model for dynamic cloud scheduling. Our model can be applied in various scenarios through selections of corresponding objectives and constraints, and offers the flexibility to express different levels of migration overhead when restructuring an existing infrastructure. Finally, our approach is evaluated using commercial clouds parameters in selected simulations for the studied scenarios. Experimental results demonstrate that, with proper parametrizations, our approach is feasible.
  •  
33.
  • Li, Wubin, 1983-, et al. (författare)
  • Virtual machine placement for predictable and time-constrained peak loads
  • 2012
  • Ingår i: Economics of Grids, Clouds, Systems, and Services. - Berlin, Heidelberg : Springer Berlin/Heidelberg. - 3642286747 - 9783642286742 - 9783642286759 ; , s. 120-134
  • Konferensbidrag (refereegranskat)abstract
    • We present an approach to optimal virtual machine placement within datacenters for predicable and time-constrained load peaks. A method for optimal load balancing is developed, based on binary integer programming. For tradeoffs between quality of solution and computation time, we also introduce methods to pre-process the optimization problem before solving it. Upper bound based optimizations are used to reduce the time required to compute a final solution, enabling larger problems to be solved. For further scalability, we also present three approximation algorithms, based on heuristics and/or greedy formulations. The proposed algorithms are evaluated through simulations based on synthetic data sets. The evaluation suggests that our algorithms are feasible, and that these can be combined to achieve desired tradeoffs between quality of solution and execution time.
  •  
34.
  • Lorido-Botran, Tania, et al. (författare)
  • An unsupervised approach to online noisy-neighbor detection in cloud data centers
  • 2017
  • Ingår i: Expert systems with applications. - : Elsevier. - 0957-4174 .- 1873-6793. ; 89, s. 188-204
  • Tidskriftsartikel (refereegranskat)abstract
    • Resource sharing is an inherent characteristic of cloud data centers. Virtual Machines (VMs) and/or Containers that are co-located in the same physical server often compete for resources leading to interference. The noisy neighbor’s effect refers to an anomaly caused by a VM/container limiting resources accessed by another one. Our main contribution is an online, lightweight and application-agnostic solution for anomaly detection, that follows an unsupervised approach. It is based on comparing models for different lags: Dirichlet Process Gaussian Mixture Models to characterize the resource usage profile of the application, and distance measures to score the similarity among models. An alarm is raised when there is an abrupt change in short-term lag (i.e. high distance score for short-term models), while the long-term state remains constant. We test the algorithm for different cloud workloads: websites, periodic batch applications, Spark-based applications, and Memcached server. We are able to detect anomalies in the CPU and memory resource usage with up to 82–96% accuracy (recall) depending on the scenario. Compared to other baseline methods, our approach is able to detect anomalies successfully, while raising low number of false positives, even in the case of applications with unusual normal behavior (e.g. periodic). Experiments show that our proposed algorithm is a lightweight and effective solution to detect noisy neighbor effect without any historical info about the application, that could also be potentially applied to other kind of anomalies.
  •  
35.
  • Mehta, Amardeep, 1985- (författare)
  • Resource allocation for Mobile Edge Clouds
  • 2018
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Recent advances in Internet technologies have led to the proliferation of new distributed applications in the transportation, healthcare, mining, security, and entertainment sectors. The emerging applications have characteristics such as being bandwidth-hungry, latency-critical, and applications with a user population contained within a limited geographical area, and require high availability, low jitter, and security.One way of addressing the challenges arising because of these emerging applications, is to move the computing capabilities closer to the end-users, at the logical edge of a network, in order to improve the performance, operating cost, and reliability of applications and services. These distributed new resources and software stacks, situated on the path between today's centralized data centers and devices in close proximity to the last mile network, are known as Mobile Edge Clouds (MECs). The distributed MECs provides new opportunities for the management of compute resources and the allocation of applications to those resources in order to minimize the overall cost of application deployment while satisfying end-user demands in terms of application performance.However, these opportunities also present three significant challenges. The first challenge is where and how much computing resources to deploy along the path between today's centralized data centers and devices for cost-optimal operations. The second challenge is where and how much resources should be allocated to which applications to meet the applications' performance requirements while minimizing operational costs. The third challenge is how to provide a framework for application deployment on resource-constrained IoT devices in heterogeneous environments. This thesis addresses the above challenges by proposing several models, algorithms, and simulation and software frameworks. In the first part, we investigate methods for early detection of short-lived and significant increase in demand for computing resources (also called spikes) which may cause significant degradation in the performance of a distributed application. We make use of adaptive signal processing techniques for early detection of spikes. We then consider trade-offs between parameters such as the time taken to detect a spike and the number of false spikes that are detected. In the second part, we study the resource planning problem where we study the cost benefits of adding new compute resources based on performance requirements for emerging applications. In the third part, we study the problem of allocating resources to applications by formulating as an optimization problem, where the objective is to minimize overall operational cost while meeting the performance targets of applications. We also propose a hierarchical scheduling framework and policies for allocating resources to applications based on performance metrics of both applications and compute resources. In the last part, we propose a framework, Calvin Constrained, for resource-constrained devices, which is an extension of the Calvin framework and supports a limited but essential subset of the features of the reference framework taking into account the limited memory and processing power of the resource-constrained IoT devices.
  •  
36.
  • Mehta, Amardeep, 1985-, et al. (författare)
  • Utility-based Allocation of Industrial IoT Applications in Mobile Edge Clouds
  • 2018
  • Ingår i: 2018 IEEE 37th International Performance Computing and Communications Conference (IPCCC). - Umeå : Umeå universitet. - 9781538668085 - 9781538668078 - 9781538668092
  • Rapport (övrigt vetenskapligt/konstnärligt)abstract
    • Mobile Edge Clouds (MECs) create new opportunities and challenges in terms of scheduling and running applications that have a wide range of latency requirements, such as intelligent transportation systems, process automation, and smart grids. We propose a two-tier scheduler for allocating runtime resources to Industrial Internet of Things (IIoTs) applications in MECs. The scheduler at the higher level runs periodically – monitors system state and the performance of applications – and decides whether to admit new applications and migrate existing applications. In contrast, the lower-level scheduler decides which application will get the runtime resource next. We use performance based metrics that tells the extent to which the runtimes are meeting the Service Level Objectives (SLOs) of the hosted applications. The Application Happiness metric is based on a single application’s performance and SLOs. The Runtime Happiness metric is based on the Application Happiness of the applications the runtime is hosting. These metrics may be used for decision-making by the scheduler, rather than runtime utilization, for example.We evaluate four scheduling policies for the high-level scheduler and five for the low-level scheduler. The objective for the schedulers is to minimize cost while meeting the SLO of each application. The policies are evaluated with respect to the number of runtimes, the impact on the performance of applications and utilization of the runtimes. The results of our evaluation show that the high-level policy based on Runtime Happiness combined with the low-level policy based on Application Happiness outperforms other policies for the schedulers, including the bin packing and random strategies. In particular, our combined policy requires up to 30% fewer runtimes than the simple bin packing strategy and increases the runtime utilization up to 40% for the Edge Data Center (DC) in the scenarios we evaluated.
  •  
37.
  • Nair, Srijith K., et al. (författare)
  • Towards secure cloud bursting, brokerage and aggregation
  • 2010
  • Ingår i: 2010 Eighth IEEE European Conference on Web Services. - : IEEE. - 9780769543109 ; , s. 189-196
  • Konferensbidrag (refereegranskat)abstract
    • The cloud based delivery model for IT resources is revolutionizing the IT industry. Despite the marketing hype around “the cloud”, the paradigm itself is in a critical transition state from the laboratories to mass market. Many technical and business aspects of cloud computing need to mature before it is widely adopted for corporate use. For example, the inability to seamlessly burst between internal cloud and external cloud platforms, termed cloud bursting, is a significant shortcoming of current cloud solutions. Furthermore, the absence of a capability that would allow to broker between multiple cloud providers or to aggregate them into a composite service inhibits the free and open competition that would help the market mature. This paper describes the concepts of cloud bursting and cloud brokerage and discusses the open management and security issues associated with the two models. It also presents a possible architectural framework capable of powering the brokerage based cloud services that is currently being developed in the scope of OPTIMIS, an EU FP7 project.
  •  
38.
  • Rochwerger, B., et al. (författare)
  • Reservoir : When one cloud is not enough
  • 2011
  • Ingår i: Computer. - : IEEE Computer Society. - 0018-9162 .- 1558-0814. ; 44:3, s. 44-51
  • Tidskriftsartikel (refereegranskat)abstract
    • As cloud computing becomes more predominant, the problem of scalability has become critical for cloud computing providers. The cloud paradigm is attractive because it offers a dramatic reduction in capital and operation expenses for consumers.
  •  
39.
  • Saleh Sedghpour, Mohammad Reza, 1989-, et al. (författare)
  • An Empirical Study of Service Mesh Traffic Management Policies for Microservices
  • 2022
  • Ingår i: ICPE '22: Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering. - New York : ACM Digital Library. ; , s. 17-27
  • Konferensbidrag (refereegranskat)abstract
    • A microservice architecture features hundreds or even thousands of small loosely coupled services with multiple instances. Because microservice performance depends on many factors including the workload, inter-service traffic management is complex in such dynamic environments. Service meshes aim to handle this complexity and to facilitate management, observability, and communication between microservices. Service meshes provide various traffic management policies such as circuit breaking and retry mechanisms, which are claimed to protect microservices against overload and increase the robustness of communication between microservices. However, there have been no systematic studies on the effects of these mechanisms on microservice performance and robustness. Furthermore, the exact impact of various tuning parameters for circuit breaking and retries are poorly understood. This work presents a large set of experiments conducted to investigate these issues using a representative microservice benchmark in a Kubernetes testbed with the widely used Istio service mesh. Our experiments reveal effective configurations of circuit breakers and retries. The findings presented will be useful to engineers seeking to configure service meshes more systematically and also open up new areas of research for academics in the area of service meshes for (autonomic) microservice resource management.
  •  
40.
  • Saleh Sedghpour, Mohammad Reza, 1989-, et al. (författare)
  • Artifact evaluation for distributed systems: current practices and beyond
  • Annan publikation (övrigt vetenskapligt/konstnärligt)abstract
    • Although repeatability and reproducibility are essential in science, failed attempts to replicate results across diverse fields made some scientists argue for a reproducibility crisis. In response, several high-profile venues within computing established artifact evaluation tracks, a systematic procedure for evaluating and badging research artifacts, with an increasing number of artifacts submitted.This study compiles recent artifact evaluation procedures and guidelines to show how artifact evaluation in distributed systems research lags behind other computing disciplines, and/or is less unified and more complex. We further argue that current artifact assessment criteria are uncoordinated and insufficient for the unique challenges of distributed systems research. We examine the current state of the practice for artifacts and their evaluation to provide recommendations to assist artifact authors, reviewers, and track chairs. Although our recommendations alone will not resolve the repeatability and reproducibility crisis, we want to start a discussion in our community to increase both the number of submitted artifacts and their quality over time.The ambition of this paper is to provide both artifact authors and reviewers with a one-stop shop for all required knowledge to make this successful.
  •  
41.
  • Saleh Sedghpour, Mohammad Reza, 1989-, et al. (författare)
  • Breaking the vicious circle : self-adaptive microservice circuit breaking and retry
  • 2023
  • Ingår i: 2023 IEEE international conference on cloud engineering. - : IEEE Computer Society. - 9798350343946 ; , s. 32-42
  • Konferensbidrag (refereegranskat)abstract
    • Microservice-based architectures consist of numerous, loosely coupled services with multiple instances. Service meshes aim to simplify traffic management and prevent microservice overload through circuit breaking and request retry mechanisms. Previous studies have demonstrated that the static configuration of these mechanisms is unfit for the dynamic environment of microservices. We conduct a sensitivity analysis to understand the impact of retrying across a wide range of scenarios. Based on the findings, we propose a retry controller that can also work with dynamically configured circuit breakers. We have empirically assessed our proposed controller in various scenarios, including transient overload and noisy neighbors while enforcing adaptive circuit breaking. The results show that our proposed controller does not deviate from a well-tuned configuration while maintaining carried response time and adapting to the changes. In comparison to the default static retry configuration that is mostly used in practice, our approach improves the carried throughput up to 12x and 32x respectively in the cases of transient overload and noisy neighbors.
  •  
42.
  • Saleh Sedghpour, Mohammad Reza, 1989-, et al. (författare)
  • Hydragen : a microservice benchmark generator
  • 2023
  • Ingår i: 2023 IEEE 16th international conference on cloud computing (CLOUD). - : IEEE. - 9798350304817 - 9798350304824 ; , s. 189-200
  • Konferensbidrag (refereegranskat)abstract
    • Microservice-based architectures have become ubiq-uitous in large-scale software systems. Experimental cloud re-searchers constantly propose enhanced resource management mechanisms for such systems. These mechanisms need to be eval-uated using both realistic and flexible microservice benchmarks to study in which ways diverse application characteristics can affect their performance and scalability. However, current mi-croservice benchmarks have limitations including static compu-tational complexity, limited architectural scale, and fixed topology (i.e., number of tiers, fan-in, and fan-out characteristics).We therefore propose HydraGen, a tool that enables re-searchers to systematically generate benchmarks with different computational complexities and topologies, to tackle experimental evaluation of performance at scale for web-serving applications, with a focus on inter-service communication. To illustrate the potential of our open-source tool, we demonstrate how it can reproduce an existing microservice benchmark with preserved architectural properties. We also demonstrate how HydraGen can enrich the evaluation of cloud management systems based on a case study related to traffic engineering.
  •  
43.
  • Saleh Sedghpour, Mohammad Reza, 1989-, et al. (författare)
  • Service mesh circuit breaker: From panic button to performance management tool
  • 2021
  • Ingår i: HAOC '21: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems. - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450383363 ; , s. 4-10
  • Konferensbidrag (refereegranskat)abstract
    • Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments.We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.
  •  
44.
  • Saleh Sedghpour, Mohammad Reza, 1989- (författare)
  • Towards self-driving microservices
  • 2023
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In recent years, microservice architecture has become a popular method for software system design and development. This involves creating applications with multiple small services, each with multiple instances, operating as independent processes. Due to the distributed nature of microservices, communication between services presents a challenging task that becomes increasingly complex as the number of services grows. This complexity can even lead to short-term failures that can degrade application performance. Therefore, the auto-tuning of inter-service communication is necessary to prevent such failures. Service meshes were introduced to offer the necessary technical capabilities that can be employed in such scenarios. In essence, a service mesh is an infrastructure layer that includes a set of configurable proxies integrated into microservices. This enables the provision of traffic management policies such as circuit breaking and retry mechanisms to enhance microservice resilience against transient failures. However, static configuration or misconfiguration of these mechanisms is unsuitable for the dynamic environment of microservices and can lead to serious issues and performance problems, such as retry storms.The goal of this thesis is three-fold. First, it aims to investigate the impact and effectiveness of service traffic management on application reliability and availability in the presence of transient failures. Second, it focuses on auto-tuning of service traffic management to increase carried throughput and maintain carried response time. Third, this research aims to propose measures that can improve research reproducibility in the area of distributed systems ensuring that the findings can be independently verified by others. In this thesis, we aim to offer detailed guidelines on best practices for implementing research software.To achieve these goals, this thesis delves into the current state-of-the-art in service meshes and eBPF-powered microservices, identifying current challenges and potential future directions. It analyzes the effects of circuit breaker and retry mechanisms on microservice performance and proposes adaptive controllers for both. The results show the need for such controllers that increase throughput while maintaining the tail response time of the application. Additionally, it proposes a microservice benchmark generator to enable systematic microservice benchmark generation and improve reproducibility. It also provides recommendations for improving artifact evaluation in distributed systems research by compiling all existing recommendations.
  •  
45.
  • Souza, Abel, PhLic. 1986-, et al. (författare)
  • A HPC Co-Scheduler with Reinforcement Learning
  • Annan publikation (övrigt vetenskapligt/konstnärligt)abstract
    • High Performance Computing (HPC) datacenters process thousands of diverse applications, supporting many scientific and business endeavours. Although users understand minimum coarse resource job requirements such as amounts of CPUs and memory, internal infrastructural utilization data and system dynamics are often visible only to cluster operators. Besides that, due to increased complexity, heuristically tweaking a batch system is even today a very challenge task. When combined with applications profiling, infrastructural data enables improvements to job scheduling, while creating space to improve Quality-of-Service (QoS) metrics such as queue waiting times and total execution times. Targeting improvements in utilization and throughput, in this paper we evaluate and propose a novel Reinforcement Learning co-scheduler algorithm that combines capacity utilization with application performance profiling. We first profile a running application by assessing its resource utilization and progress by means of a forest of decision trees, enabling our algorithm to understand the application’s resource capacity usage. We then use this information to estimate how much capacity from this ongoing allocation can be allocated for co-scheduling additional applications. Because estimations may go wrong, our algorithm has to learn and evaluate when co-scheduling decisions results in QoS degradation, such as application slowness. To overcome this, we devised a co-scheduling architecture and a handful metric to help minimizing performance degradation, enabling improvements on utilization of up to 25% even when the cluster is experiencing high demands, with 10% average queue makespan reductions when experiencing low loads.Together with the architecture, our algorithm forms the base of an application-aware co-scheduler for improved datacenter utilization and minimal performance degradation.
  •  
46.
  • Souza, Abel, PhD, 1986-, et al. (författare)
  • A HPC Co-scheduler with Reinforcement Learning
  • 2021
  • Ingår i: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). - Cham : Springer. - 9783030882235 - 9783030882242 ; , s. 126-148
  • Konferensbidrag (refereegranskat)abstract
    • Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to configure batch schedulers. This task is challenging and increasingly complex due to ever larger cluster scales and heterogeneity of modern scientific workflows. As a result, HPC systems achieve low utilization with long job completion times (makespans). To tackle these challenges, we propose a co-scheduling algorithm based on an adaptive reinforcement learning algorithm, where application profiling is combined with cluster monitoring. The resulting cluster scheduler matches resource utilization to application performance in a fine-grained manner (i.e., operating system level). As opposed to nominal allocations, we apply decision trees to model applications’ actual resource usage, which are used to estimate how much resource capacity from one allocation can be co-allocated to additional applications. Our algorithm learns from incorrect co-scheduling decisions and adapts from changing environment conditions, and evaluates when such changes cause resource contention that impacts quality of service metrics such as jobs slowdowns. We integrate our algorithm in an HPC resource manager that combines Slurm and Mesos for job scheduling and co-allocation, respectively. Our experimental evaluation performed in a dedicated cluster executing a mix of four real different scientific workflows demonstrates improvements on cluster utilization of up to 51% even in high load scenarios, with 55% average queue makespan reductions under low loads.
  •  
47.
  • Souza, Abel, PhLic. 1986-, et al. (författare)
  • ASA - The Adaptive Scheduling Architecture
  • 2020
  • Ingår i: HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. - New York, NY, USA : ACM Digital Library. - 9781450370523 ; , s. 161-165
  • Konferensbidrag (refereegranskat)abstract
    • In High Performance Computing (HPC), resources are controlled by batch systems and may not be available due to long queue waiting times, negatively impacting application deadlines. This is noticeable in low latency scientific workflows where resource planning and timely allocation are key for efficient processing. On the one hand, peak allocations guarantee the fastest possible workflows execution time, at the cost of extended queue waiting times and costly resource usage. On the other hand, dynamic allocations following specific workflow stage requirements optimizes resource usage, though it increases the total workflow makespan. To enable new scheduling strategies and features in workflows, we propose ASA: the Adaptive Scheduling Architecture, a novel scheduling method to reduce perceived queue waiting times as well as to optimize workflows resource usage. Reinforcement learning is used to estimate queue waiting times, and based on these estimates ASA pro-actively submit resource change requests, minimizing total workflow inter-stage waiting times, idle resources, and makespan. Experiments with three scientific workflows at two HPC centers show that ASA combines the best of the two aforementioned approaches, with average queue waiting time and makespan reductions of up to 10% and 2% respectively, with up to 100% prediction accuracy, while obtaining near optimal resource utilization.
  •  
48.
  • Souza, Abel, PhLic. 1986- (författare)
  • Autonomous resource management for high performance datacenters
  • 2020
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Over the last decade, new applications such as data intensive workflows have hit an inflection point in wide spread use and influenced the compute paradigm of most scientific and industrial endeavours. Data intensive workflows are highly dynamic and adaptable to resource changes, system faults, and by also allowing approximate solutions into their models. On the one hand, these dynamic characteristics require processing power and capabilities originated in cloud computing environments, and are not well supported by large High Performance Computing (HPC) infrastructures. On the other hand, cloud computing datacenters favor low latency over throughput, deeply contrasting with HPC, which enforces a centralized environment and prioritizes total computation accomplished over-time, ignoring latency entirely. Although data handling needs are predicted to increase by as much as a thousand times over the next decade, future datacenters processing power will not increase as much.To tackle these long-term developments, this thesis proposes autonomic methods combined with novel scheduling strategies to optimize datacenter utilization while guaranteeing user defined constraints and seamlessly supporting a wide range of applications under various real operational scenarios. Leveraging upon data intensive characteristics, a library is developed to dynamically adjust the amount of resources used throughout the lifespan of a workflow, enabling elasticity for such applications in HPC datacenters. For mission critical environments where services must run even in the event of system failures, we define an adaptive controller to dynamically select the best method to perform runtime state synchronizations. We develop different hybrid extensible architectures and reinforcement learning scheduling algorithms that smoothly enable dynamic applications into HPC environments. An overall theme in this thesis is extensive experimentation in real datacenters environments. Our results show improvements in datacenter utilization and performance, achieving higher overall efficiency. Our methods also simplify operations and allow the onboarding of novel types of applications previously not supported.
  •  
49.
  • Souza, Abel, 1986-, et al. (författare)
  • Hybrid adaptive checkpointing for virtual machine fault tolerance
  • 2018
  • Ingår i: Proceedings - 2018 IEEE International Conference on Cloud Engineering, IC2E 2018. - : Institute of Electrical and Electronics Engineers Inc.. - 9781538650080 - 9781538650097 ; , s. 12-22
  • Konferensbidrag (refereegranskat)abstract
    • Active Virtual Machine (VM) replication is an application independent and cost-efficient mechanism for high availability and fault tolerance, with several recently proposed implementations based on checkpointing. However, these methods may suffer from large impacts on application latency, excessive resource usage overheads, and/or unpredictable behavior for varying workloads. To address these problems, we propose a hybrid approach through a Proportional-Integral (PI) controller to dynamically switch between periodic and on-demand check-pointing. Our mechanism automatically selects the method that minimizes application downtime by adapting itself to changes in workload characteristics. The implementation is based on modifications to QEMU, LibVirt, and OpenStack, to seamlessly provide fault tolerant VM provisioning and to enable the controller to dynamically select the best checkpointing mode. Our evaluation is based on experiments with a video streaming application, an e-commerce benchmark, and a software development tool. The experiments demonstrate that our adaptive hybrid approach improves both application availability and resource usage compared to static selection of a checkpointing method, with application performance gains and neglectable overheads.
  •  
50.
  • Souza, Abel, 1986-, et al. (författare)
  • Hybrid Resource Management for HPC and Data Intensive Workloads
  • 2019
  • Ingår i: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). - Los Alamitos : IEEE Computer Society. - 9781728109138 - 9781728109121 ; , s. 399-409
  • Konferensbidrag (refereegranskat)abstract
    • Traditionally, High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed on separate hardware using different tools for resource and application management. With increasing convergence of these paradigms, where modern applications are composed of both types of jobs in complex workflows, this separation becomes a growing overhead and the need for a common computation platform for both application areas increases. Executing both application classes on the same hardware not only enables hybrid workflows, but can also increase the usage efficiency of the system, as often not all available hardware is fully utilized by an application. While HPC systems are typically managed in a coarse grained fashion, allocating a fixed set of resources exclusively to an application, DI systems employ a finer grained regime, enabling dynamic resource allocation and control based on application needs. On the path to full convergence, a useful and less intrusive step is a hybrid resource management system that allows the execution of DI applications on top of standard HPC scheduling systems.In this paper we present the architecture of a hybrid system enabling dual-level scheduling for DI jobs in HPC infrastructures. Our system takes advantage of real-time resource utilization monitoring to efficiently co-schedule HPC and DI applications. The architecture is easily adaptable and extensible to current and new types of distributed workloads, allowing efficient combination of hybrid workloads on HPC resources with increased job throughput and higher overall resource utilization. The architecture is implemented based on the Slurm and Mesos resource managers for HPC and DI jobs. Our experimental evaluation in a real cluster based on a set of representative HPC and DI applications demonstrate that our hybrid architecture improves resource utilization by 20%, with 12% decrease on queue makespan while still meeting all deadlines for HPC jobs.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-50 av 68

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy