SwePub - search: WFRF:(Carbone Paris)

Enumeration	Reference	Cover	Find
1.	Abbas, Zainab (author) Scalable Streaming Graph and Time Series Analysis Using Partitioning and Machine Learning 2021 Doctoral thesis (other academic/artistic)abstract Recent years have witnessed a massive increase in the amount of data generated by the Internet of Things (IoT) and social media. Processing huge amounts of this data poses non-trivial challenges in terms of the hardware and performance requirements of modern-day applications. The data we are dealing with today is of massive scale, high intensity and comes in various forms. MapReduce was a popular and clever choice of handling big data using a distributed programming model, which made the processing of huge volumes of data possible using clusters of commodity machines. However, MapReduce was not a good fit for performing complex tasks, such as graph processing, iterative programs and machine learning. Modern data processing frameworks, that are being popularly used to process complex data and perform complex analysis tasks, overcome the shortcomings of MapReduce. Some of these popular frameworks include Apache Spark for batch and stream processing, Apache Flink for stream processing and Tensor Flow for machine learning.In this thesis, we deal with complex analytics on data modeled as time series, graphs and streams. Time series are commonly used to represent temporal data generated by IoT sensors. Analysing and forecasting time series, i.e. extracting useful characteristics and statistics of data and predicting data, is useful for many fields that include, neuro-physiology, economics, environmental studies, transportation, etc. Another useful data representation we work with, are graphs. Graphs are complex data structures used to represent relational data in the form of vertices and edges. Graphs are present in various application domains, such as recommendation systems, road traffic analytics, web analysis, social media analysis. Due to the increasing size of graph data, a single machine is often not sufficient to process the complete graph. Therefore, the computation, as well as the data, must be distributed. Graph partitioning, the process of dividing graphs into subgraphs, is an essential step in distributed graph processing of large scale graphs because it enables parallel and distributed processing.The majority of data generated from IoT and social media originates as a continuous stream, such as series of events from a social media network, time series generated from sensors, financial transactions, etc. The stream processing paradigm refers to the processing of data streaming that is continuous and possibly unbounded. Combining both graphs and streams leads to an interesting and rather challenging domain of streaming graph analytics. Graph streams refer to data that is modelled as a stream of edges or vertices with adjacency lists representing relations between entities of continuously evolving data generated by a single or multiple data sources. Streaming graph analytics is an emerging research field with great potential due to its capabilities of processing large graph streams with limited amounts of memory and low latency. In this dissertation, we present graph partitioning techniques for scalable streaming graph and time series analysis. First, we present and evaluate the use of data partitioning to enable data parallelism in order to address the challenge of scale in large spatial time series forecasting. We propose a graph partitioning technique for large scale spatial time series forecasting of road traffic as a use-case. Our experimental results on traffic density prediction for real-world sensor dataset using Long Short-Term Memory Neural Networks show that the partitioning-based models take 12x lower training time when run in parallel compared to the unpartitioned model of the entire road infrastructure. Furthermore, the partitioning-based models have 2x lower prediction error (RMSE) compared to the entire road model. Second, we showcase the practical usefulness of streaming graph analytics for large spatial time series analysis with the real-world task of traffic jam detection and reduction. We propose to apply streaming graph analytics by performing useful analytics on traffic data stream at scale with high throughput and low latency. Third, we study, evaluate, and compare the existing state-of-the-art streaming graph partitioning algorithms. We propose a uniform analysis framework built using Apache Flink to evaluate and compare partitioning features and characteristics of streaming graph partitioning methods. Finally, we present GCNSplit, a novel ML-driven streaming graph partitioning solution, that uses a small and constant in-memory state (bounded state) to partition (possibly unbounded) graph streams. Our results demonstrate that \ours provides high-throughput partitioning and can leverage data parallelism to sustain input rates of 100K edges/s. GCNSplit exhibits a partitioning quality, in terms of graph cuts and load balance, that matches that of the state-of-the-art HDRF (High Degree Replicated First) algorithm while storing three orders of magnitude smaller partitioning state.
2.	Abbas, Zainab, 1991-, et al. (author) Streaming Graph Partitioning: An Experimental Study 2018 In: Proceedings of the VLDB Endowment. - : ACM Digital Library. - 2150-8097. ; 11:11, s. 1590-1603 Journal article (peer-reviewed)abstract Graph partitioning is an essential yet challenging task for massive graph analysis in distributed computing. Common graph partitioning methods scan the complete graph to obtain structural characteristics offline, before partitioning. However, the emerging need for low-latency, continuous graph analysis led to the development of online partitioning methods. Online methods ingest edges or vertices as a stream, making partitioning decisions on the fly based on partial knowledge of the graph. Prior studies have compared offline graph partitioning techniques across different systems. Yet, little effort has been put into investigating the characteristics of online graph partitioning strategies.In this work, we describe and categorize online graph partitioning techniques based on their assumptions, objectives and costs. Furthermore, we employ an experimental comparison across different applications and datasets, using a unified distributed runtime based on Apache Flink. Our experimental results showcase that model-dependent online partitioning techniques such as low-cut algorithms offer better performance for communication-intensive applications such as bulk synchronous iterative algorithms, albeit higher partitioning costs. Otherwise, model-agnostic techniques trade off data locality for lower partitioning costs and balanced workloads which is beneficial when executing data-parallel single-pass graph algorithms.
3.	Carbone, Paris, et al. (author) Apache flink : Stream and batch processing in a single engine 2015 In: The Bulletin of the Technical Committee on Data Engineering. - : Institute of Electrical and Electronics Engineers (IEEE). ; 38:4 Journal article (peer-reviewed)
4.	Carbone, Paris, et al. (author) Auto-Scoring of Personalised News in the Real-Time Web : Challenges, Overview and Evaluation of the State-of-the-Art Solutions 2015 Conference paper (peer-reviewed)abstract The problem of automated personalised news recommendation, often referred as auto-scoring has attracted substantial research throughout the last decade in multiple domains such as data mining and machine learning, computer systems, e commerce and sociology. A typical "recommender systems" approach to solving this problem usually adopts content-based scoring, collaborative filtering or more often a hybrid approach. Due to their special nature, news articles introduce further challenges and constraints to conventional item recommendation problems, characterised by short lifetime and rapid popularity trends. In this survey, we provide an overview of the challenges and current solutions in news personalisation and ranking from both an algorithmic and system design perspective, and present our evaluation of the most representative scoring algorithms while also exploring the benefits of using a hybrid approach. Our evaluation is based on a real-life case study in news recommendations.
5.	Carbone, Paris, et al. (author) Beyond Analytics : The Evolution of Stream Processing Systems 2020 In: Proceedings of the ACM SIGMOD International Conference on Management of Data. - New York, NY, USA : Association for Computing Machinery. - 9781450367356 ; , s. 2651-2658 Conference paper (peer-reviewed)abstract Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. The goal of this tutorial is threefold. First, we aim to review and highlight noteworthy past research findings, which were largely ignored until very recently. Second, we intend to underline the differences between early ('00-'10) and modern ('11-'18) streaming systems, and how those systems have evolved through the years. Most importantly, we wish to turn the attention of the database community to recent trends: streaming systems are no longer used only for classic stream processing workloads, namely window aggregates and joins. Instead, modern streaming systems are being increasingly used to deploy general event-driven applications in a scalable fashion, challenging the design decisions, architecture and intended use of existing stream processing systems.
6.	Carbone, Paris, et al. (author) Cutty : Aggregate Sharing for User-Defined Windows 2016 In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450340731 ; , s. 1201-1210 Conference paper (peer-reviewed)abstract Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user-defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all.In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are declared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.
7.	Carbone, Paris, et al. (author) Large-scale data stream processing systems 2017 In: Handbook of Big Data Technologies. - Cham : Springer International Publishing. - 9783319493404 - 9783319493398 ; , s. 219-260 Book chapter (other academic/artistic)abstract In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data with low latency and high throughput. Large-scale data stream processing systems perceive data as infinite streams and are designed to satisfy such requirements. They have further evolved substantially both in terms of expressive programming model support and also efficient and durable runtime execution on commodity clusters. Expressive programming models offer convenient ways to declare continuous data properties and applied computations, while hiding details on how these data streams are physically processed and orchestrated in a distributed environment. Execution engines provide a runtime for such models further allowing for scalable yet durable execution of any declared computation. In this chapter we introduce the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns. We then present a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor that is used for a wide variety of processing tasks. Finally, we address the main challenges of disruptive applications that large-scale data streaming enables from a systemic point of view.
8.	Carbone, Paris, 1986-, et al. (author) Lightweight Asynchronous Snapshots for Distributed Dataflows 2015 Reports (other academic/artistic)abstract Distributed stateful stream processing enables the deployment and execution of large scale continuous computations in the cloud, targeting both low latency and high throughput. One of the most fundamental challenges of this paradigm is providing processing guarantees under potential failures. Existing approaches rely on periodic global state snapshots that can be used for failure recovery. Those approaches suffer from two main drawbacks. First, they often stall the overall computation which impacts ingestion. Second, they eagerly persist all records in transit along with the operation states which results in larger snapshots than required. In this work we propose Asynchronous Barrier Snapshotting (ABS), a lightweight algorithm suited for modern dataflow execution engines that minimises space requirements. ABS persists only operator states on acyclic execution topologies while keeping a minimal record log on cyclic dataflows. We implemented ABS on Apache Flink, a distributed analytics engine that supports stateful stream processing. Our evaluation shows that our algorithm does not have a heavy impact on the execution, maintaining linear scalability and performing well with frequent snapshots.
9.	Carbone, Paris, 1986- (author) Scalable and Reliable Data Stream Processing 2018 Doctoral thesis (other academic/artistic)abstract Data-stream management systems have for long been considered as a promising architecture for fast data management. The stream processing paradigm poses an attractive means of declaring persistent application logic coupled with state over evolving data. However, despite contributions in programming semantics addressing certain aspects of data streaming, existing approaches have been lacking a clear, universal specification for the underlying system execution. We investigate the case of data stream processing as a general-purpose scalable computing architecture that can support continuous and iterative state-driven workloads. Furthermore, we examine how this architecture can enable the composition of reliable, reconfigurable services and complex applications that go even beyond the needs of scalable data analytics, a major trend in the past decade.In this dissertation, we specify a set of core components and mechanisms to compose reliable data stream processing systems while adopting three crucial design principles: blocking-coordination avoidance, programming-model transparency, and compositionality. Furthermore, we identify the core open challenges among the academic and industrial state of the art and provide a complete solution using these design principles as a guide. Our contributions address the following problems: I) Reliable Execution and Stream State Management, II) Computation Sharing and Semantics for Stream Windows, and III) Iterative Data Streaming. Several parts of this work have been integrated into Apache Flink, a widely-used, open-source scalable computing framework, and supported the deployment of hundreds of long-running large-scale production pipelines worldwide.
10.	Carbone, Paris, et al. (author) State Management in Apache Flink : Consistent Stateful Distributed Stream Processing 2017 In: Proceedings of the VLDB Endowment. - : ACM Digital Library. - 2150-8097. ; 10, s. 1718-1729 Journal article (peer-reviewed)abstract Stream processors are emerging in industry as an apparatus that drives analytical but also mission critical services handling the core of persistent application logic. Thus, apart from scalability and low-latency, a rising system need is first-class support for application state together with strong consistency guarantees, and adaptivity to cluster reconfigurations, software patches and partial failures. Although prior systems research has addressed some of these specific problems, the practical challenge lies on how such guarantees can be materialized in a transparent, non-intrusive manner that relieves the user from unnecessary constraints. Such needs served as the main design principles of state management in Apache Flink, an open source, scalable stream processor.We present Flink’s core pipelined, in-flight mechanism which guarantees the creation of lightweight, consistent, distributed snapshots of application state, progressively, without impacting continuous execution. Consistent snapshots cover all needs for system reconfiguration, fault tolerance and version management through coarse grained rollback recovery. Application state is declared explicitly to the system, allowing efficient partitioning and transparent commits to persistent storage. We further present Flink’s backend implementations and mechanisms for high availability, external state queries and output commit. Finally, we demonstrate how these mechanisms behave in practice with metrics and large deployment insights exhibiting the low performance trade-offs of our approach and the general benefits of exploiting asynchrony in continuous, yet sustainable system deployments.
11.	Carbone, Paris, et al. (author) Towards highly available complex event processing deployments in the cloud 2013 In: International Conference on Next Generation Mobile Applications, Services, and Technologies. - : IEEE. - 9781479920198 ; , s. 153-158 Conference paper (peer-reviewed)abstract Recent advances in distributed computing have made it possible to achieve high availability on traditional systems and thus serve them as reliable services. For several offline computational applications, such as fine grained batch processing, their parallel nature in addition to weak consistency requirements allowed a more trivial transition. On the other hand, on-line processing systems such as Complex Event Processing (CEP) still maintain a monolithic architecture, being able to offer high expressiveness and vertical scalability at the expense of low distribution. Despite attempts to design dedicated distributed CEP systems there is potential for existing systems to benefit from a sustainable cloud deployment. In this work we address the main challenges of providing such a CEP service with a focus on reliability, since it is the most crucial aspect of that transition. Our approach targets low average detection latency and sustain-ability by leveraging event delegation mechanisms present on existing stream execution platforms. It also introduces redundancy and transactional logging to provide improved fault tolerance and partial recovery. Our performance analysis illustrates the benefits of our approach and shows acceptable performance costs for on-line CEP exhibited by the fault tolerance mechanisms we introduced.
12.	Fragkoulis, Marios, et al. (author) A survey on the evolution of stream processing systems 2024 In: The VLDB journal. - : Springer Science and Business Media Deutschland GmbH. - 1066-8888 .- 0949-877X. ; 33:2, s. 507-541 Journal article (peer-reviewed)abstract Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between the first (’00–’10) and second (’11–’23) generation of stream processing systems, and discuss future trends and open problems.
13.	Hasselberg, Adam, et al. (author) Cliffhanger : An Experimental Evaluation of Stateful Serverless at the Edge 2024 In: 2024 19th Wireless On-Demand Network Systems and Services Conference. - : IEEE. ; , s. 41-48 Conference paper (peer-reviewed)abstract The serverless computing paradigm has transformed cloud service deployment by enabling automatic scaling of resources in response to varying demand. Building on this, stateful serverless computing introduces critical capabilities for data management, fault tolerance, and consistency, which are particularly relevant in the context of distributed deployments, notably in edge computing environments. In this work, we explore the feasibility of stateful serverless computing in resource-limited edge environments through an empirical study utilizing a multi-view object tracking application. Our results show that while these systems perform well in cloud environments, their effectiveness is severely affected at the edge due to state, application, and resource management solutions optimized for cloud environments. Existing solutions are most detrimental to applications with intermittent workloads, as typical combinations of concurrency handling and resource reservation can lead to minutes of unstable system behavior due to cold starts. Our results highlight the need for a tailored approach in stateful serverless systems for edge computing scenarios.
14.	Horchidan, Sonia-Florina, et al. (author) Crayfish: Navigating the Labyrinth of Machine Learning Inference in Stream Processing Systems 2024 In: Advances in Database Technology - EDBT. - : Open Proceedings.org. ; , s. 676-689 Conference paper (peer-reviewed)abstract As Machine Learning predictions are increasingly being used in business analytics pipelines, integrating stream processing with model serving has become a common data engineering task. Despite their synergies, separate software stacks typically handle streaming analytics and model serving. Systems for data stream management do not support ML inference out-of-the-box, while model-serving frameworks have limited functionality for continuous data transformations, windowing, and other streaming tasks. As a result, developers are left with a design space dilemma whose trade-offs are not well understood. This paper presents Crayfish, an extensible benchmarking framework that facilitates designing and executing comprehensive evaluation studies of streaming inference pipelines. We demonstrate the capabilities of Crayfish by studying four data processing systems, three embedded libraries, three external serving frameworks, and two pre-trained models. Our results prove the necessity of a standardized benchmarking framework and show that (1) even for serving tools in the same category, the performance can vary greatly and, sometimes, defy intuition, (2) GPU accelerators can show compelling improvements for the serving task, but the improvement varies across tools, and (3) serving alternatives can achieve significantly different performance, depending on the stream processors they are integrated with.
15.	Horchidan, Sonia-Florina, et al. (author) Evaluating model serving strategies over streaming data 2022 In: Proceedings of the 6th Workshop on Data Management for End-To-End Machine Learning, DEEM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference. - New York, NY, USA : Association for Computing Machinery (ACM). Conference paper (peer-reviewed)abstract We present the first performance evaluation study of model serving integration tools in stream processing frameworks. Using Apache Flink as a representative stream processing system, we evaluate alternative Deep Learning serving pipelines for image classification. Our performance evaluation considers both the case of embedded use of Machine Learning libraries within stream tasks and that of external serving via Remote Procedure Calls. The results indicate superior throughput and scalability for pipelines that make use of embedded libraries to serve pre-trained models. Whereas, latency can vary across strategies, with external serving even achieving lower latency when network conditions are optimal due to better specialized use of underlying hardware. We discuss our findings and provide further motivating arguments towards research in the area of ML-native data streaming engines in the future.
16.	Horchidan, Sonia, et al. (author) ORB : Empowering Graph Queries through Inference 2023 In: ESWC-JP 2023. - : CEUR-WS. Conference paper (peer-reviewed)abstract Executing queries on incomplete, sparse knowledge graphs yields incomplete results, especially when it comes to queries involving traversals. In this paper, we question the applicability of all known architectures for incomplete knowledge bases and propose ORB: a clear departure from existing system designs, relying on Machine Learning-based operators to provide inferred query results. At the same time, ORB addresses peculiarities inherent to knowledge graphs, such as schema evolution, dynamism, scalability, as well as high query complexity via the use of embedding-driven inference. Through ORB, we stress that approximating complex processing tasks is not only desirable but also imperative for knowledge graphs.
17.	Kroll, Lars, 1989-, et al. (author) Arc : An IR for batch and stream programming 2019 In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). - New York, NY, USA : Association for Computing Machinery (ACM). - 9781450367189 ; , s. 53-58 Conference paper (peer-reviewed)abstract In big data analytics, there is currently a large number of data programming models and their respective frontends such as relational tables, graphs, tensors, and streams. This has lead to a plethora of runtimes that typically focus on the efficient execution of just a single frontend. This fragmentation manifests itself today by highly complex pipelines that bundle multiple runtimes to support the necessary models. Hence, joint optimization and execution of such pipelines across these frontend-bound runtimes is infeasible. We propose Arc as the first unified Intermediate Representation (IR) for data analytics that incorporates stream semantics based on a modern specification of streams, windows and stream aggregation, to combine batch and stream computation models. Arc extends Weld, an IR for batch computation and adds support for partitioned, out-of-order stream and window operators which are the most fundamental building blocks in contemporary data streaming.
18.	Kroll, Lars, 1989-, et al. (author) Kompics Scala : Narrowing the gap between algorithmic specification and executable code (short paper) 2017 In: Proceedings of the 8th ACM SIGPLAN International Symposium on Scala. - New York, NY, USA : ACM Digital Library. - 9781450355292 ; , s. 73-77 Conference paper (peer-reviewed)abstract Message-based programming frameworks facilitate the development and execution of core distributed computing algorithms today. Their twofold aim is to expose a programming model that minimises logical errors incurred during translation from an algorithmic specification to executable program, and also to provide an efficient runtime for event pattern-matching and scheduling of distributed components. Kompics Scala is a framework that allows for a direct, streamlined translation from a formal algorithm specification to practical code by reducing the cognitive gap between the two representations. Furthermore, its runtime decouples event pattern-matching and component execution logic yielding clean, thoroughly expected behaviours. Our evaluation shows low and constant performance overhead of Kompics Scala compared to similar frameworks that otherwise fail to offer the same level of model clarity.
19.	Lindén, Joakim, et al. (author) Autonomous Realization of Safety- and Time-Critical Embedded Artificial Intelligence 2024 In: 2024 Design, Automation and Test in Europe Conference and Exhibition, DATE 2024 - Proceedings. - : Institute of Electrical and Electronics Engineers (IEEE). Conference paper (peer-reviewed)abstract There is an evident need to complement embedded critical control logic with AI inference, but today's AI-capable hardware, software, and processes are primarily targeted towards the needs of cloud-centric actors. Telecom and defense airspace industries, which make heavy use of specialized hardware, face the challenge of manually hand-tuning AI workloads and hardware, presenting an unprecedented cost and complexity due to the diversity and sheer number of deployed instances. Furthermore, embedded AI functionality must not adversely affect real-time and safety requirements of the critical business logic. To address this, end-to-end AI pipelines for critical platforms are needed to automate the adaption of networks to fit into resource-constrained devices under critical and real-time constraints, while remaining interoperable with de-facto standard AI tools and frameworks used in the cloud. We present two industrial applications where such solutions are needed to bring AI to critical and resource-constrained hardware, and a generalized end-to-end AI pipeline that addresses these needs. Crucial steps to realize it are taken in the industry-academia collaborative FASTER-AI project.
20.	Meldrum, M., et al. (author) Arcon : Continuous and deep data stream analytics 2019 In: ACM International Conference Proceeding Series. - New York, NY, USA : Association for Computing Machinery. - 9781450376600 Conference paper (peer-reviewed)abstract Contemporary end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensor transformations, and graphs. For each of these workload types, there exists several frontends (e.g., SQL, Beam, Keras) based on different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that optimize for a particular frontend and possibly a hardware architecture (e.g., GPUs). The resulting pipelines suffer in terms of complexity and performance due to excessive type conversions, materialization of intermediate results, and lack of cross-framework optimizations. Arcon aims to provide a unified approach to declare and execute tasks across frontend-boundaries as well as enabling their seamless integration with event-driven services at scale. In this demonstration, we present Arcon and through a series of use-case scenarios demonstrate that its execution model is powerful enough to cover existing as well as upcoming real-time computations for analytics and application-specific needs.
21.	Ng, Harald, et al. (author) Omni-Paxos : Breaking the Barriers of Partial Connectivity 2023 In: Proceedings of the 18th European Conference on Computer Systems, EuroSys 2023. - : Association for Computing Machinery, Inc. - 9781450394871 ; , s. 314-330 Conference paper (peer-reviewed)abstract Omni-Paxos is a system for state machine replication that is completely resilient to partial network partitions, a major source of service disruptions in recent years. Omni-Paxos achieves its resilience through a decoupled design that separates the execution and state of leader election from log replication. The leader election builds on the concept of quorum-connected servers, with the sole focus on connectivity. Additionally, by decoupling reconfiguration from log replication, Omni-Paxos provides flexible and parallel log migration that improves the performance and robustness of reconfiguration. Our evaluation showcases two benefits over state-of-the-art protocols: (1) guaranteed recovery in at most four election timeouts under extreme partial network partitions, and (2) up to 8x shorter reconfiguration periods with 46% less I/O at the leader.
22.	Ng, Harald, et al. (author) UniCache : Efficient Log Replication through Learning Workload Patterns 2023 In: Advances in Database Technology - EDBT. - : OpenProceedings.org. ; , s. 471-477 Conference paper (peer-reviewed)abstract Most of the world's cloud data service workloads are currently being backed by replicated state machines. Production-grade log replication protocols used for the job impose heavy data transfer duties on the primary server which need to disseminate the log commands to all the replica servers. UniCache proposes a principal solution to this problem using a learned replicated cache which enables commands to be sent over the network as compressed encodings. UniCache takes advantage of that each replica has access to a consistent prefix of the replicated log which allows them to build a uniform lookup cache used for compressing and decompressing commands consistently. UniCache achieves effective speedups, lowering the primary load in application workloads with a skewed data distribution. Our experimental studies showcase a low pre-processing overhead and the highest performance gains in cross-data center deployments over wide area networks.
23.	Perini, M., et al. (author) Learning on streaming graphs with experience replay 2022 In: Proceedings of the ACM Symposium on Applied Computing. - New York, NY, USA : Association for Computing Machinery. - 9781450387132 ; , s. 470-478 Conference paper (peer-reviewed)abstract Graph Neural Networks (GNNs) have recently achieved good performance in many predictive tasks involving graph-structured data. However, the majority of existing models consider static graphs only and do not support training on graph streams. While inductive representation learning can generate predictions for unseen vertices, these are only accurate if the learned graph structure and properties remain stable over time. In this paper, we study the problem of employing experience replay to enable continuous graph representation learning in the streaming setting. We propose two online training methods, Random-Based Rehearsal-RBR, and Priority-Based Rehearsal-PBR, which avoid retraining from scratch when changes occur. Our algorithms are the first streaming GNN models capable of scaling to million-edge graphs with low training latency and without compromising accuracy. We evaluate the accuracy and training performance of these experience replay methods on the node classification problem using real-world streaming graphs of various sizes and domains. Our results demonstrate that PBR and RBR achieve orders of magnitude faster training as compared to offline methods while providing high accuracy and resiliency to concept drift.
24.	Sakr, Sherif, et al. (author) Dagstuhl Seminar on Big Stream Processing 2018 In: SIGMOD record. - : ASSOC COMPUTING MACHINERY. - 0163-5808 .- 1943-5835. ; 47:3, s. 36-39 Journal article (peer-reviewed)abstract Stream processing can generate insights from big data in real time as it is being produced. This paper reports findings from a 2017 seminar on big stream processing, focusing on applications, systems, and languages.
25.	Spenger, Jonas, et al. (author) A Survey of Actor-Like Programming Models for Serverless Computing 2024 In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (LNCS,volume 14360). - : Springer Science and Business Media Deutschland GmbH. ; , s. 123-146 Book chapter (other academic/artistic)abstract Serverless computing promises to significantly simplify cloud computing by providing Functions-as-a-Service where invocations of functions, triggered by events, are automatically scheduled for execution on compute nodes. Notably, the serverless computing model does not require the manual provisioning of virtual machines; instead, FaaS enables load-based billing and auto-scaling according to the workload, reducing costs and making scheduling more efficient. While early serverless programming models only supported stateless functions and severely restricted program composition, recently proposed systems offer greater flexibility by adopting ideas from actor and dataflow programming. This paper presents a survey of actor-like programming abstractions for stateful serverless computing, and provides a characterization of their properties and highlights their origin. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
26.	Spenger, Jonas, et al. (author) Portals : An Extension of Dataflow Streaming for Stateful Serverless 2022 In: Proceedings of the 2022 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. - New York, NY, USA : Association for Computing Machinery. ; , s. 153-171, s. 153-171 Conference paper (peer-reviewed)abstract Portals is a serverless, distributed programming model that blends the exactly-once processing guarantees of stateful dataflow streaming frameworks with the message-driven compositionality of actor frameworks. Decentralized applications in Portals can be built dynamically, scale on demand, and always satisfy strict atomic processing guarantees that are natively embedded in the framework’s principal elements of computation, known as atomic streams. In this paper, we describe the capabilities of Portals and demonstrate its use in supporting several popular existing distributed programming paradigms and use-cases. We further introduce all programming model invariants and the corresponding system methods used to satisfy them.
27.	Spenger, Jonas, et al. (author) Portals : A Showcase of Multi-Dataflow Stateful Serverless 2023 In: Proceedings of the VLDB Endowment. - : ACM Digital Library. ; , s. 4054-4057, s. 4054-4057 Conference paper (peer-reviewed)abstract Serverless applications spanning the cloud and edge require flexible programming frameworks for expressing compositions across the different levels of deployment. Another critical aspect for applications with state is failure resilience beyond the scope of a single dataflow graph that is the current standard in data streaming systems. This paper presents Portals, an interactive, stateful dataflow composition framework with strong end-to-end guarantees. Portals enables event-driven, resilient applications that span across dataflow graphs and serverless deployments. The demonstration exhibits three scenarios in our multi-dataflow streaming-based system: dynamically composing a stateful serverless application; an interactive cloud and edge serverless application; and a Portals browser playground. This work was partially funded by Digital Futures, the Swedish Foundation for Strategic Research under Grant No.: BD15-0006, as well as RISE AI.
28.	Spenger, Jonas, et al. (author) WIP: PODS : Privacy Compliant Scalable Decentralized Data Services 2021 In: Heterogeneous Data Management, Polystores, and Analytics for Healthcare. - Cham : Springer. ; , s. 70-82 Conference paper (peer-reviewed)abstract Modern data services need to meet application developers’ demands in terms of scalability and resilience, and also support privacy regulations such as the EU’s GDPR. We outline the main systems challenges of supporting data privacy regulations in the context of large-scale data services, and advocate for causal snapshot consistency to ensure application-level and privacy-level consistency. We present Pods, an extension to the dataflow model that allows external services to access snapshotted operator state directly, with built-in support for addressing the outlined privacy challenges, and summarize open questions and research directions.
29.	Zwolak, Michal, et al. (author) GCNSplit : Bounding the State of Streaming Graph Partitioning 2022 In: Proceedings of the 5th International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM 2022 - In conjunction with the 2022 ACM SIGMOD/PODS Conference. - New York, NY, USA : Association for Computing Machinery, Inc. Conference paper (peer-reviewed)abstract This paper introduces GCNSplit, a streaming graph partitioning framework capable of handling unbounded streams with bounded state requirements. We frame partitioning as a classification problem and we employ an unsupervised model whose loss function minimizes edge-cuts. GCNSplit leverages an inductive graph convolutional network (GCN) to embed graph characteristics into a low-dimensional space and assign edges to partitions in an online manner. We evaluate GCNSplit with real-world graph datasets of various sizes and domains. Our results demonstrate that GCNSplit provides high-throughput, top-quality partitioning, and successfully leverages data parallelism. It achieves a throughput of 430K edges/s on a real-world graph of 1.6B edges using a bounded 147KB-sized model, contrary to the state-of-the-art HDRF algorithm that requires > 116GB in-memory state. With a well-balanced normalized load of 1.01, GCNSplit achieves a replication factor on par with HDRF, showcasing high partitioning quality while storing three orders of magnitude smaller partitioning state. Owing to the power of GCNs, we show that GCNSplit can generalize to entirely unseen graphs while outperforming the state-of-the-art stream partitioners in some cases.

Skapa referenser, mejla, bekava och länka

Permalink

Träfflista för sökning "WFRF:(Carbone Paris) "

Refine your search

Year