SwePub - search: WFRF:(Stadler Rolf Prof.)

Enumeration	Reference	Cover	Find
1.	Yanggratoke, Rerngvit, 1983- (author) Data-driven Performance Prediction and Resource Allocation for Cloud Services 2016 Doctoral thesis (other academic/artistic)abstract Cloud services, which provide online entertainment, enterprise resource management, tax filing, etc., are becoming essential for consumers, businesses, and governments. The key functionalities of such services are provided by backend systems in data centers. This thesis focuses on three fundamental problems related to management of backend systems. We address these problems using data-driven approaches: triggering dynamic allocation by changes in the environment, obtaining configuration parameters from measurements, and learning from observations. The first problem relates to resource allocation for large clouds with potentially hundreds of thousands of machines and services. We developed and evaluated a generic gossip protocol for distributed resource allocation. Extensive simulation studies suggest that the quality of the allocation is independent of the system size for the management objectives considered.The second problem focuses on performance modeling of a distributed key-value store, and we study specifically the Spotify backend for streaming music. We developed analytical models for system capacity under different data allocation policies and for response time distribution. We evaluated the models by comparing model predictions with measurements from our lab testbed and from the Spotify operational environment. We found the prediction error to be below 12% for all investigated scenarios.The third problem relates to real-time prediction of service metrics, which we address through statistical learning. Service metrics are learned from observing device and network statistics. We performed experiments on a server cluster running video streaming and key-value store services. We showed that feature set reduction significantly improves the prediction accuracy, while simultaneously reducing model computation time. Finally, we designed and implemented a real-time analytics engine, which produces model predictions through online learning.
2.	Gonzalez Prieto, Alberto, 1977- (author) Adaptive Real-time Monitoring for Large-scale Networked Systems 2008 Doctoral thesis (other academic/artistic)abstract Large-scale networked systems, such as the Internet and server clusters, are omnipresent today. They increasingly deliver services that are critical to both businesses and the society at large, and therefore their continuous and correct operation must be guaranteed. Achieving this requires the realization of adaptive management systems, which continuously reconfigure such large-scale dynamic systems, in order to maintain their state near a desired operating point, despite changes in the networking conditions.The focus of this thesis is continuous real-time monitoring, which is essential for the realization of adaptive management systems in large-scale dynamic environments. Real-time monitoring provides the necessary input to the decision-making process of network management, enabling management systems to perform self-configuration and self-healing tasks.We have developed, implemented, and evaluated a design for real-time continuous monitoring of global metrics with performance objectives, such as monitoring overhead and estimation accuracy. Global metrics describe the state of the system as a whole, in contrast to local metrics, such as device counters or local protocol states, which capture the state of a local entity. Global metrics are computed from local metrics using aggregation functions, such as SUM, AVERAGE and MAX.Our approach is based on in-network aggregation, where global metrics are incrementally computed using spanning trees. Performance objectives are achieved through filtering updates to local metrics that are sent along that tree. A key part in the design is a model for the distributed monitoring process that relates performance metrics to parameters that tune the behavior of a monitoring protocol. The model allows us to describe the behavior of individual nodes in the spanning tree in their steady state. The model has been instrumental in designing a monitoring protocol that is controllable and achieves given performance objectives.We have evaluated our protocol, called A-GAP, experimentally, through simulation and testbed implementation. It has proved to be effective in meeting performance objectives, efficient, adaptive to changes in the networking conditions, controllable along different performance dimensions, and scalable. We have implemented a prototype on a testbed of commercial routers. The testbed measurements are consistent with simulation studies we performed for different topologies and network sizes. This proves the feasibility of the design, and, more generally, the feasibility of effective and efficient real-time monitoring in large network environments.
3.	Ahmed, J., et al. (author) Automated diagnostic of virtualized service performance degradation 2018 In: Proceedings 2018 IEEE/IFIP Network Operations and Management Symposium, NOMS 2018. - New York : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 1-9 Conference paper (peer-reviewed)abstract Service assurance for cloud applications is a challenging task and is an active area of research for academia and industry. One promising approach is to utilize machine learning for service quality prediction and fault detection so that suitable mitigation actions can be executed. In our previous work, we have shown how to predict service-level metrics in real-time just from operational data gathered at the server side. This gives the service provider early indications on whether the platform can support the current load demand. This paper provides the logical next step where we extend our work by proposing an automated detection and diagnostic capability for the performance faults manifesting themselves in cloud and datacenter environments. This is a crucial task to maintain the smooth operation of running services and minimizing downtime. We demonstrate the effectiveness of our approach which exploits the interpretative capabilities of Self- Organizing Maps (SOMs) to automatically detect and localize different performance faults for cloud services.
4.	Burgess, Mark, et al. (author) Network patterns in cfengine and scalable data aggregation 2007 In: USENIX ASSOCIATION PROCEEDING OF THE 21ST LARGE INSTALLATION SYSTEMS ADMINISTRATION CONFERENCE. - : USENIX ASSOC. ; , s. 275- Conference paper (peer-reviewed)abstract Network patterns are based on generic algorithms that execute on tree-based overlays. A set of such patterns has been developed at KTH to support distributed monitoring in networks with non-trivial topologies. We consider the use of this approach in logical peer networks in cfengine as a way of scaling aggregation of data to large organizations. Use of 'deep' network structures can lead to temporal anomalies. We show how to minimize temporal fragmentation during data aggregation by using time offsets and what effect these choices might have on power consumption. We offer proof of concept for this technology to initiate either multicast or inverse multicast pulses through sensor networks.
5.	Chemouil, Prosper, et al. (author) Special Issue on Advances in Artificial Intelligence and Machine Learning for Networking 2020 In: IEEE Journal on Selected Areas in Communications. - : IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. - 0733-8716 .- 1558-0008. ; 38:10, s. 2229-2233 Journal article (other academic/artistic)abstract Artificial Intelligence (AI) and Machine Learning (ML) approaches have emerged in the networking domain with great expectation. They can be broadly divided into AI/ML techniques for network engineering and management, network designs for AI/ML applications, and system concepts. AI/ML techniques for networking and management improve the way we address networking. They support efficient, rapid, and trustworthy engineering, operations, and management. As such, they meet the current interest in softwarization and network programmability that fuels the need for improved network automation in agile infrastructures, including edge and fog environments. Network design and optimization for AI/ML applications addresses the complementary topic of supporting AI/ML-based systems through novel networking techniques, including new architectures and algorithms. The third topic area is system implementation and open-source software development.
6.	Chemouil, Prosper, et al. (author) Special Issue on Artificial Intelligence and Machine Learning for Networking and Communications 2019 In: IEEE Journal on Selected Areas in Communications. - : IEEE. - 0733-8716 .- 1558-0008. ; 37:6, s. 1185-1191 Journal article (peer-reviewed)
7.	Di Fatta, G., et al. (author) Preface 2011 In: IEEE International Conference on Data Mining. Proceedings. - : Institute of Electrical and Electronics Engineers (IEEE). - 1550-4786. ; , s. xlviii-xlvix Journal article (peer-reviewed)
8.	Hammar, Kim, et al. (author) A System for Interactive Examination of Learned Security Policies 2022 In: Proceedings of the IEEE/IFIP Network Operations and Management Symposium 2022. - : IEEE. Conference paper (peer-reviewed)abstract We present a system for interactive examination of learned security policies. It allows a user to traverse episodes of Markov decision processes in a controlled manner and to track the actions triggered by security policies. Similar to a software debugger, a user can continue or or halt an episode at any time step and inspect parameters and probability distributions of interest. The system enables insight into the structure of a given policy and in the behavior of a policy in edge cases. We demonstrate the system with a network intrusion use case. We examine the evolution of an IT infrastructure's state and the actions prescribed by security policies while an attack occurs. The policies for the demonstration have been obtained through a reinforcement learning approach that includes a simulation system where policies are incrementally learned and an emulation system that produces statistics that drive the simulation runs.
9.	Hammar, Kim, et al. (author) An Online Framework for Adapting Security Policies in Dynamic IT Environments 2022 In: 2022 18Th International Conference On Network And Service Management (CNSM 2022). - : IEEE. Conference paper (peer-reviewed)abstract We present an online framework for learning and updating security policies in dynamic IT environments. It includes three components: a digital twin of the target system, which continuously collects data and evaluates learned policies; a system identification process, which periodically estimates system models based on the collected data; and a policy learning process that is based on reinforcement learning. To evaluate our framework, we apply it to an intrusion prevention use case that involves a dynamic IT infrastructure. Our results demonstrate that the framework automatically adapts security policies to changes in the IT infrastructure and that it outperforms a stateof-the-art method.
10.	Hammar, Kim, et al. (author) Digital Twins for Security Automation 2023 In: Proceedings of IEEE/IFIP Network Operations and Management Symposium 2023, NOMS 2023. - : Institute of Electrical and Electronics Engineers (IEEE). Conference paper (peer-reviewed)abstract We present a novel emulation system for creating high-fidelity digital twins of IT infrastructures. The digital twins replicate key functionality of the corresponding infrastructures and allow to play out security scenarios in a safe environment. We show that this capability can be used to automate the process of finding effective security policies for a target infrastructure. In our approach, a digital twin of the target infrastructure is used to run security scenarios and collect data. The collected data is then used to instantiate simulations of Markov decision processes and learn effective policies through reinforcement learning, whose performances are validated in the digital twin. This closed-loop learning process executes iteratively and provides continuously evolving and improving security policies. We apply our approach to an intrusion response scenario. Our results show that the digital twin provides the necessary evaluative feedback to learn near-optimal intrusion response policies.
11.	Hammar, Kim, et al. (author) Finding Effective Security Strategies through Reinforcement Learning and Self-Play 2020 In: 2020 16th international conference on network and service management (CNSM). - : IEEE. Conference paper (peer-reviewed)abstract We present a method to automatically find security strategies for the use case of intrusion prevention. Following this method, we model the interaction between an attacker and a defender as a Markov game and let attack and defense strategies evolve through reinforcement learning and self-play without human intervention. Using a simple infrastructure configuration, we demonstrate that effective security strategies can emerge from self-play. This shows that self-play, which has been applied in other domains with great success, can be effective in the context of network security. Inspection of the converged policies show that the emerged policies reflect common-sense knowledge and are similar to strategies of humans. Moreover, we address known challenges of reinforcement learning in this domain and present an approach that uses function approximation, an opponent pool, and an autoregressive policy representation. Through evaluations we show that our method is superior to two baseline methods but that policy convergence in self-play remains a challenge.
12.	Hammar, Kim, et al. (author) Intrusion Prevention Through Optimal Stopping 2022 In: IEEE Transactions on Network and Service Management. - : Institute of Electrical and Electronics Engineers (IEEE). - 1932-4537. ; 19:3, s. 2333-2348 Journal article (peer-reviewed)abstract We study automated intrusion prevention using reinforcement learning. Following a novel approach, we formulate the problem of intrusion prevention as an (optimal) multiple stopping problem. This formulation gives us insight into the structure of optimal policies, which we show to have threshold properties. For most practical cases, it is not feasible to obtain an optimal defender policy using dynamic programming. We therefore develop a reinforcement learning approach to approximate an optimal threshold policy. We introduce T- SPSA, an efficient reinforcement learning algorithm that learns threshold policies through stochastic approximation. We show that T- SPSA outperforms state-of-the-art algorithms for our use case. Our overall method for learning and validating policies includes two systems: a simulation system where defender policies are incrementally learned and an emulation system where statistics are produced that drive simulation runs and where learned policies are evaluated. We show that this approach can produce effective defender policies for a practical IT infrastructure.
13.	Hammar, Kim, et al. (author) Learning Intrusion Prevention Policies through Optimal Stopping 2021 In: Proceedings Of The 2021 17Th International Conference On Network And Service Management (CNSM 2021). - : IEEE. ; , s. 509-517 Conference paper (peer-reviewed)abstract We study automated intrusion prevention using reinforcement learning. In a novel approach, we formulate the problem of intrusion prevention as an optimal stopping problem. This formulation allows us insight into the structure of the optimal policies, which turn out to be threshold based. Since the computation of the optimal defender policy using dynamic programming is not feasible for practical cases, we approximate the optimal policy through reinforcement learning in a simulation environment. To define the dynamics of the simulation, we emulate the target infrastructure and collect measurements. Our evaluations show that the learned policies are close to optimal and that they indeed can be expressed using thresholds.
14.	Hammar, Kim, et al. (author) Learning Near-Optimal Intrusion Responses Against Dynamic Attackers 2024 In: IEEE Transactions on Network and Service Management. - : Institute of Electrical and Electronics Engineers (IEEE). - 1932-4537. ; 21:1, s. 1158-1177 Journal article (peer-reviewed)abstract We study automated intrusion response and formulate the interaction between an attacker and a defender as an optimal stopping game where attack and defense strategies evolve through reinforcement learning and self-play. The game-theoretic modeling enables us to find defender strategies that are effective against a dynamic attacker, i.e., an attacker that adapts its strategy in response to the defender strategy. Further, the optimal stopping formulation allows us to prove that best response strategies have threshold properties. To obtain near-optimal defender strategies, we develop Threshold Fictitious Self-Play (T-FP), a fictitious self-play algorithm that learns Nash equilibria through stochastic approximation. We show that T-FP outperforms a state-of-the-art algorithm for our use case. The experimental part of this investigation includes two systems: a simulation system where defender strategies are incrementally learned and an emulation system where statistics are collected that drive simulation runs and where learned strategies are evaluated. We argue that this approach can produce effective defender strategies for a practical IT infrastructure.
15.	Hammar, Kim, et al. (author) Scalable Learning of Intrusion Response Through Recursive Decomposition 2023 In: Decision and Game Theory for Security - 14th International Conference, GameSec 2023, Proceedings. - : Springer Nature. ; , s. 172-192 Conference paper (peer-reviewed)abstract We study automated intrusion response for an IT infrastructure and formulate the interaction between an attacker and a defender as a partially observed stochastic game. To solve the game we follow an approach where attack and defense strategies co-evolve through reinforcement learning and self-play toward an equilibrium. Solutions proposed in previous work prove the feasibility of this approach for small infrastructures but do not scale to realistic scenarios due to the exponential growth in computational complexity with the infrastructure size. We address this problem by introducing a method that recursively decomposes the game into subgames with low computational complexity which can be solved in parallel. Applying optimal stopping theory we show that the best response strategies in these subgames exhibit threshold structures, which allows us to compute them efficiently. To solve the decomposed game we introduce an algorithm called Decompositional Fictitious Self-Play (dfsp), which learns Nash equilibria through stochastic approximation. We evaluate the learned strategies in an emulation environment where real intrusions and response actions can be executed. The results show that the learned strategies approximate an equilibrium and that dfsp significantly outperforms a state-of-the-art algorithm for a realistic infrastructure configuration.
16.	Jennings, Brendan, et al. (author) Resource Management in Clouds : Survey and Research Challenges 2015 In: Journal of Network and Systems Management. - : Springer Science and Business Media LLC. - 1064-7570 .- 1573-7705. ; 23:3, s. 567-619 Journal article (peer-reviewed)abstract Resource management in a cloud environment is a hard problem, due to: the scale of modern data centers; the heterogeneity of resource types and their interdependencies; the variability and unpredictability of the load; as well as the range of objectives of the different actors in a cloud ecosystem. Consequently, both academia and industry began significant research efforts in this area. In this paper, we survey the recent literature, covering 250+ publications, and highlighting key results. We outline a conceptual framework for cloud resource management and use it to structure the state-of-the-art review. Based on our analysis, we identify five challenges for future investigation. These relate to: providing predictable performance for cloud-hosted applications; achieving global manageability for cloud systems; engineering scalable resource management systems; understanding economic behavior and cloud pricing; and developing solutions for the mobile cloud paradigm.
17.	Samani, Forough Shahab, et al. (author) A Framework for dynamically meeting performance objectives on a service mesh 2024 Other publication (other academic/artistic)abstract We present a framework for achieving end-to-end management objectives for multiple services that concurrently execute on a service mesh. We apply reinforcement learning (RL) techniques to train an agent that periodically performs control actions to reallocate resources. We develop and evaluate the framework using a laboratory testbed where we run information and computing services on a service mesh, supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, cost-related objectives, and service differentiation. Our framework supports the design of a control agent for a given management objective. It is novel in that it advocates a top-down approach whereby the management objective is defined first and then mapped onto the available control actions. Several types of control actions can be executed simultaneously, which allows for efficient resource utilization. Second, the framework separates learning of the system model and the operating region from learning of the control policy. By first learning the system model and the operating region from testbed traces, we can instantiate a simulator and train the agent for different management objectives in parallel. Third, the use of a simulator shortens the training time by orders of magnitude compared with training the agent on the testbed. We evaluate the learned policies on the testbed and show the effectiveness of our approach in several scenarios. In one scenario, we design a controller that achieves the management objectives with $50\%$ less system resources than Kubernetes HPA autoscaling.
18.	Samani, Forough Shahab, et al. (author) Conditional Density Estimation of Service Metrics for Networked Services 2021 In: IEEE Transactions on Network and Service Management. - : Institute of Electrical and Electronics Engineers (IEEE). - 1932-4537. ; 18:2, s. 2350-2364 Journal article (peer-reviewed)abstract We predict the conditional distributions of service metrics, such as response time or frame rate, from infrastructure measurements in a networked environment. From such distributions, key statistics of the service metrics, including mean, variance, or quantiles can be computed, which are essential for predicting SLA conformance and enabling service assurance. We present and assess two methods for prediction: (1) mixture models with Gaussian or Lognormal kernels, whose parameters are estimated using mixture density networks, a class of neural networks, and (2) histogram models, which require the target space to be discretized. We apply these methods to a VoD service and a KV store service running on our lab testbed. A comparative evaluation shows the relative effectiveness of the methods when applied to operational data. We find that both methods allow for accurate prediction. While mixture models provide a general and elegant solution, they incur a very high overhead related to hyper-parameter search and neural network training. Histogram models, on the other hand, allow for efficient training, but require adjustment to the specific use case.
19.	Samani, Forough Shahab, et al. (author) Demonstrating a System for Dynamically Meeting Management Objectives on a Service Mesh 2023 In: Proceedings of IEEE/IFIP Network Operations and Management Symposium 2023, NOMS 2023. - : Institute of Electrical and Electronics Engineers (IEEE). Conference paper (peer-reviewed)abstract We demonstrate a management system that lets a service provider achieve end-to-end management objectives under varying load for applications on a service mesh based on the Istio and Kubernetes platforms. The management objectives for the demonstration include end-to-end delay bounds on service requests, throughput objectives, and service differentiation. Our method for finding effective control policies includes a simulator and a control module. The simulator is instantiated with traces from a testbed, and the control module trains a reinforcement learning (RL) agent to efficiently learn effective control policies on the simulator. The learned policies are then transfered to the testbed to perform dynamic control actions based on monitored system metrics. We show that the learned policies dynamically meet management objectives on the testbed and can be changed on the fly.
20.	Samani, Forough Shahab, et al. (author) Dynamically meeting performance objectives for multiple services on a service mesh 2022 In: 2022 18TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT (CNSM 2022). - : IEEE. ; , s. 219-225 Conference paper (peer-reviewed)abstract We present a framework that lets a service provider achieve end-to-end management objectives under varying load. Dynamic control actions are performed by a reinforcement learning (RL) agent. Our work includes experimentation and evaluation on a laboratory testbed where we have implemented basic information services on a service mesh supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, and service differentiation. These objectives are mapped onto reward functions that an RL agent learns to optimize, by executing control actions, namely, request routing and request blocking. We compute the control policies not on the testbed, but in a simulator, which speeds up the learning process by orders of magnitude. In our approach, the system model is learned on the testbed; it is then used to instantiate the simulator, which produces near-optimal control policies for various management objectives. The learned policies are then evaluated on the testbed using unseen load patterns.
21.	Santos, A., et al. (author) Message from dissertation digest chairs 2012 In: Proceedings of the 2012 IEEE Network Operations and Management Symposium. - : Institute of Electrical and Electronics Engineers (IEEE). Conference paper (peer-reviewed)
22.	Shahabsamani, Forough, 1988- (author) End-to-end performance prediction and automated resource management of cloud services 2024 Doctoral thesis (other academic/artistic)abstract Cloud-based services are integral to modern life. Cloud systems aim to provide customers with uninterrupted services of high quality while enabling cost-effective fulfillment by providers. The key to meet quality requirements and end-to-end performance objectives is to devise effective strategies to allocate resources to the services. This in turn requires automation of resource allocation. Recently, researchers have studied learning-based approaches, especially reinforcement learning (RL) for automated resource allocation. These approaches are particularly promising to perform resource allocation in cloud systems as they allow to deal with the architectural complexity of a cloud environment. Previous research shows that reinforcement learning is effective for specific types of controls, such as horizontal or vertical scaling of compute resources. Major obstacles for operational deployment remain however. Chief among them is the fact that reinforcement learning methods require long times for training and retraining after system changes. With this thesis, we aim to overcome these obstacles and demonstrate dynamic resource allocation using reinforcement learning on a testbed. On the conceptual level, we address two interconnected problems: predicting end-to-end service performance and automated resource allocation for cloud services. First, we study methods to predict the conditional density of service metrics and demonstrate the effectiveness of employing dimensionality reduction methods to reduce monitoring, communication, and model-training overhead. For automated resource allocation, we develop a framework for RL-based control. Our approach involves learning a system model from measurements, using a simulator to learn resource allocation policies, and adapting these policies online using a rollout mechanism. Experimental results from our testbed show that using our framework, we can effectively achieve end-to-end performance objectives by dynamically allocating resources to the services using different types of control actions simultaneously.
23.	Shahabsamani, Forough, et al. (author) Online Policy Adaptation for Networked Systems using Rollout 2024 Conference paper (peer-reviewed)abstract Dynamic resource allocation in networked systems is needed to continuously achieve end-to-end management objectives. Recent research has shown that reinforcement learning can achieve near-optimal resource allocation policies for realistic system configurations. However, most current solutions require expensive retraining when changes in the system occur. We address this problem and introduce an efficient method to adapt a given base policy to system changes, e.g., to a change in the service offering. In our approach, we adapt a base control policy using a rollout mechanism, which transforms the base policy into an improved rollout policy. We perform extensive evaluations on a testbed where we run applications on a service mesh based on the Istio and Kubernetes platforms. The experiments provide insights into the performance of different rollout algorithms. We find that our approach produces policies that are equally effective as those obtained by offline retraining. On our testbed, effective policy adaptation takes seconds when using rollout, compared to minutes or hours when using retraining. Our work demonstrates that rollout, which has been applied successfully in other domains, is an effective approach for policy adaptation in networked systems.
24.	Shahabsamani, Forough Shahab, et al. (author) Comparing Transfer Learning and Rollout for Policy Adaptation in a Changing Network Environment 2024 Conference paper (peer-reviewed)abstract Dynamic resource allocation for network services is pivotal for achieving end-to-end management objectives. Previous research has demonstrated that Reinforcement Learning (RL) is a promising approach to resource allocation in networks, allowing to obtain near-optimal control policies for non-trivial system configurations. Current RL approaches however have the drawback that a change in the system or the management objective necessitates expensive retraining of the RL agent.To tackle this challenge, practical solutions including offline retraining, transfer learning, and model-based rollout have been proposed. In this work, we study these methods and present comparative results that shed light on their respective performance and benefits. Our study finds that rollout achieves faster adaptation than transfer learning, yet its effectiveness highly depends on the accuracy of the system model.
25.	Uddin, Misbah, et al. (author) A bottom-up design for spatial search in large networks and clouds 2018 In: International Journal of Network Management. - : WILEY. - 1055-7148 .- 1099-1190. ; 28:6 Journal article (peer-reviewed)abstract APPENDIX Information in networked systems often has spatial semantics: routers, sensors, or virtual machines have coordinates in a geographical or virtual space, for instance. In this paper, we propose a design for a spatial search system that processes queries against spatial information that is maintained in local databases inside a large networked system. In contrast to previous works in spatial databases and peer-to-peer designs, our design is bottom-up, which makes query routing network aware and thus efficient, and which facilitates system bootstrapping and adaptation. Key to our design is a protocol that creates and maintains a distributed index of object locations based on information from local databases and the underlying network topology. The index builds upon minimum bounding rectangles to efficiently encode locations. We present a generic search protocol that is based on an echo protocol and uses the index to prune the search space and perform query routing. The response times of search queries increase with the diameter of the network, which is asymptotically optimal. We study the performance of the protocol through simulation in static and dynamic network environments, for different network topologies, and for network sizes up to 100 000 nodes. In most experiments, the overhead incurred by our protocol lies well below 30% of a hypothetical optimal protocol. In addition, the protocol provides high accuracy under significant churn.
26.	Villaca, Rodolfo S., et al. (author) Online Learning under Resource Constraints 2021 In: 2021 IFIP/IEEE INTERNATIONAL SYMPOSIUM ON INTEGRATED NETWORK MANAGEMENT (IM 2021). - : IEEE. ; , s. 134-142 Conference paper (peer-reviewed)abstract Data-driven functions for network operation and management are based upon AI/ML methods whose models are usually trained offline with measurement data collected through monitoring. Online learning provides an alternative with the prospect of shorter learning times and lower overhead, suitable for edge or other resource-constraint environments. We propose an approach to online learning that involves a cache of fixed size to store measurement samples and periodic re-computation of ML models. Key to this approach are sample selection algorithms that decide which samples are stored in the cache and which are evicted. We present and evaluate four sample selection algorithms, all of which are derived from well-studied algorithms, and we specifically argue that feature selection algorithms can be used for our purpose. We perform an extensive evaluation of these algorithms for the task of performance prediction using data from an in-house testbed. We find that one of them (RR-SS) leads to models that achieve a prediction accuracy close to that obtained through offline learning, but at a much lower cost.
27.	Wang, Xiaoxuan, et al. (author) Online Feature Selection for Efficient Learning in Networked Systems 2022 In: IEEE Transactions on Network and Service Management. - : Institute of Electrical and Electronics Engineers (IEEE). - 1932-4537. ; 19:3, s. 2885-2898 Journal article (peer-reviewed)abstract Current AI/ML methods for data-driven engineering use models that are mostly trained offline. Such models can be expensive to build in terms of communication and computing costs, and they rely on data that is collected over extended periods of time. Further, they become out-of-date when changes in the system occur. To address these challenges, we investigate online learning techniques that automatically reduce the number of available data sources for model training. We present an online algorithm called Online Stable Feature Set Algorithm (OSFS), which selects a small feature set from a large number of available data sources after receiving a small number of measurements. The algorithm is initialized with a feature ranking algorithm, a feature set stability metric, and a search policy. We perform an extensive experimental evaluation of this algorithm using traces from an in-house testbed and from two external datasets. We find that OSFS achieves a massive reduction in the size of the feature set by 1-3 orders of magnitude on all investigated datasets. Most importantly, we find that the accuracy of a predictor trained on a OSFS-produced feature set is somewhat better than when the predictor is trained on a feature set obtained through offline feature selection. OSFS is thus shown to be effective as an online feature selection algorithm and robust regarding the sample interval used for feature selection. We also find that, when concept drift in the data underlying the model occurs, its effect can be mitigated by recomputing the feature set and retraining the prediction model.
28.	Wang, Xiaoxuan, et al. (author) Online Feature Selection for Low-overhead Learning in Networked Systems 2021 In: Proceedings of the 2021 17th International Conference on Network and Service Management. - : Institute of Electrical and Electronics Engineers Inc.. ; , s. 527-529 Conference paper (peer-reviewed)abstract Data-driven functions for operation and management require measurements and readings from distributed data sources for model training and prediction. While the number of candidate data sources can be very large, research has shown that it is often possible to reduce the number of data sources significantly while still allowing for accurate prediction. Consequently, there is potential to lower communication and computing resources needed to continuously extract, collect, and process this data. We demonstrate the operation of a novel online algorithm called OSFS, which sequentially processes the collected data and reduces the number of data sources for training prediction models. OSFS builds on two main ideas, namely (1) ranking the available data sources using (unsupervised) feature selection algorithms and (2) identifying stable feature sets that include only the top features. The demonstration shows the search space exploration, the iterative selection of feature sets, and the evaluation of the stability of these sets. The demonstration uses measurements collected from a KTH testbed, and the predictions relate to end-to-end KPIs for network services.
29.	Yanggratoke, Rerngvit, 1983- (author) Contributions to Performance Modeling and Management of Data Centers 2013 Licentiate thesis (other academic/artistic)abstract Over the last decade, Internet-based services, such as electronic-mail, music-on-demand, and social-network services, have changed the ways we communicate and access information. Usually, the key functionality of such a service is in backend components, which are located in a data center, a facility for hosting computing systems and related equipment. This thesis focuses on two fundamental problems related to the management, dimensioning, and provisioning of such backend components.The first problem centers around resource allocation for a large-scale cloud environment. Data centers have become very large; they often contain hundreds of thousands of machines and applications. In such a data center, resource allocation cannot be efficiently achieved through a traditional management system that is centralized in nature. Therefore, a more scalable solution is needed. To address this problem, we have developed and evaluated a scalable and generic protocol for resource allocation. The protocol is generic in the sense that it can be instantiated for different management objectives through objective functions. The protocol jointly allocates CPU, memory, and network resources to applications that are hosted by the cloud. We prove that the protocol converges to a solution, if an objective function satisfies a certain property. We perform a simulation study of the protocol for realistic scenarios. Simulation results suggest that the quality of the allocation is independent of the system size, up to 100,000 machines and applications, for the management objectives considered.The second problem is related to performance modeling of a distributed key-value store. The specific distributed key-value store we focus on in this thesis is the Spotify storage system. Understanding the performance of the Spotify storage system is essential for achieving a key quality of service objective, namely that the playback latency of a song is sufficiently low. To address this problem, we have developed and evaluated models for predicting the performance of a distributed key-value store for a lightly loaded system. First, we developed a model that allows us to predict the response time distribution of requests. Second, we modeled the capacity of the distributed key-value store for two different object allocation policies. We evaluate the models by comparing model predictions with measurements from two different environments: our lab testbed and a Spotify operational environment. We found that the models are accurate in the sense that the prediction error, i.e., the difference between the model predictions and the measurements from the real systems, is at most 11%.

Skapa referenser, mejla, bekava och länka

Permalink

Träfflista för sökning "WFRF:(Stadler Rolf Prof.) "

Refine your search

Year