SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Niazi Salman) "

Sökning: WFRF:(Niazi Salman)

  • Resultat 1-10 av 18
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Bessani, A., et al. (författare)
  • BiobankCloud : A platform for the secure storage, sharing, and processing of large biomedical data sets
  • 2016
  • Ingår i: 1st International Workshop on Data Management and Analytics for Medicine and Healthcare, DMAH 2015 and Workshop on Big-Graphs Online Querying, Big-O(Q) 2015 held in conjunction with 41st International Conference on Very Large Data Bases, VLDB 2015. - Cham : Springer. - 9783319415758 - 9783319415765 ; , s. 89-105
  • Konferensbidrag (refereegranskat)abstract
    • Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and process the incoming wave of genomic data from NGS machines. In the BiobankCloud project, we are building a Hadoop-based platform for the secure storage, sharing, and parallel processing of genomic data. We extended Hadoop to include support for multi-tenant studies, reduced storage requirements with erasure coding, and added support for extensible and consistent metadata. On top of Hadoop, we built a scalable scientific workflow engine featuring a proper workflow definition language focusing on simple integration and chaining of existing tools, adaptive scheduling on Apache Yarn, and support for iterative dataflows. Our platform also supports the secure sharing of data across different, distributed Hadoop clusters. The software is easily installed and comes with a user-friendly web interface for running, managing, and accessing data sets behind a secure 2-factor authentication. Initial tests have shown that the engine scales well to dozens of nodes. The entire system is open-source and includes pre-defined workflows for popular tasks in biomedical data analysis, such as variant identification, differential transcriptome analysis using RNA-Seq, and analysis of miRNA-Seq and ChIP-Seq data.
  •  
2.
  •  
3.
  • Chikafa, Gibson, 1993-, et al. (författare)
  • Cloud-native RStudio on Kubernetes for Hopsworks
  • 2023
  • Annan publikation (övrigt vetenskapligt/konstnärligt)abstract
    • In order to fully benefit from cloud computing, services are designed following the “multi-tenant” architectural model, which is aimed at maximizing resource sharing among users. However, multi-tenancy introduces challenges of security, performance isolation, scaling, and customization. RStudio server is an open-source Integrated Development Environment (IDE) accessible over a web browser for the R programming language. We present the design and implementation of a multi-user distributed system on Hopsworks, a data-intensive AI platform, following the multi-tenant model that provides RStudio as Software as a Service (SaaS). We use the most popular cloud-native technologies: Docker and Kubernetes, to solve the problems of performance isolation, security, and scaling that are present in a multi-tenant environment. We further enable secure data sharing in RStudio server instances to provide data privacy and allow collaboration among RStudio users. We integrate our system with Apache Spark, which can scale and handle Big Data processing workloads. Also, we provide a UI where users can provide custom configurations and have full control of their own RStudio server instances. Our system was tested on a Google Cloud Platform cluster with four worker nodes, each with 30GB of RAM allocated to them. The tests on this cluster showed that 44 RStudio servers, each with 2GB of RAM, can be run concurrently. Our system can scale out to potentially support hundreds of concurrently running RStudio servers by adding more resources (CPUs and RAM) to the cluster or system.
  •  
4.
  • de la Rua Martinez, Javier, et al. (författare)
  • The Hopsworks Feature Store for Machine Learning
  • 2024
  • Ingår i: SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data. - : Association for Computing Machinery (ACM). ; , s. 135-147
  • Konferensbidrag (refereegranskat)abstract
    • Data management is the most challenging aspect of building Machine Learning (ML) systems. ML systems can read large volumes of historical data when training models, but inference workloads are more varied, depending on whether it is a batch or online ML system. The feature store for ML has recently emerged as a single data platform for managing ML data throughout the ML lifecycle, from feature engineering to model training to inference. In this paper, we present the Hopsworks feature store for machine learning as a highly available platform for managing feature data with API support for columnar, row-oriented, and similarity search query workloads. We introduce and address challenges solved by the feature stores related to feature reuse, how to organize data transformations, and how to ensure correct and consistent data between feature engineering, model training, and model inference. We present the engineering challenges in building high-performance query services for a feature store and show how Hopsworks outperforms existing cloud feature stores for training and online inference query workloads.
  •  
5.
  • Gholami, Ali, et al. (författare)
  • Privacy-Preservation for Publishing Sample Availability Data with Personal Identifiers
  • 2015
  • Ingår i: Journal of medical and bioengineering. - : EJournal Publishing. - 2301-3796. ; 4:2, s. 117-125
  • Tidskriftsartikel (refereegranskat)abstract
    • Medical organizations collect, store and process vast amounts of sensitive information about patients. Easy access to this information by researchers is crucial to improving medical research, but in many institutions, cumbersome security measures and walled-gardens have created a situation where even information about what medical data is out there is not available. One of the main security challenges in this area, is enabling researchers to cross-link different medical studies, while preserving the privacy of the patients involved. In this paper, we introduce a privacy-preserving system for publishing sample availability data that allows researchers to make queries that crosscut different studies. That is, researchers can ask questions such as how many patients have had both diabetes and prostate cancer, where the diabetes and prostate cancer information originates from different clinical registries. We realize our solution by having a two-level anonymiziation mechanism, where our toolkit for publishing availability data first pseudonymizes personal identifiers and then anonymizes sensitive attributes. Our toolkit also includes a web-based server that stores the encrypted pseudonymized sample data and allows researchers to execute cross-linked queries across different study data. We believe that our toolkit contributes a first step to support the privacy preserving publication of data containing personal identifiers.
  •  
6.
  • Ismail, Mahmoud, et al. (författare)
  • Distributed Hierarchical File Systems strike back in the Cloud
  • 2020
  • Ingår i: 2020 IEEE 40th international conference on distributed computing systems (ICDCS). - : Institute of Electrical and Electronics Engineers (IEEE). ; , s. 820-830
  • Konferensbidrag (refereegranskat)abstract
    • Cloud service providers have aligned on availability zones as an important unit of failure and replication for storage systems. An availability zone (AZ) has independent power, networking, and cooling systems and consists of one or more data centers. Multiple AZs in close geographic proximity form a region that can support replicated low latency storage services that can survive the failure of one or more AZs. Recent reductions in inter-AZ latency have made synchronous replication protocols increasingly viable, instead of traditional quorum-based replication protocols. We introduce HopsFS-CL, a distributed hierarchical file system with support for high-availability (HA) across AZs, backed by AZ-aware synchronously replicated metadata and AZ-aware block replication. HopsFS-CL is a redesign of HopsFS, a version of HDFS with distributed metadata, and its design involved making replication protocols and block placement protocols AZ-aware at all layers of its stack: the metadata serving, the metadata storage, and block storage layers. In experiments on a real-world workload from Spotify, we show that HopsFS-CL, deployed in HA mode over 3 AZs, reaches 1.66 million ops/s, and has similar performance to HopsFS when deployed in a single AZ, while preserving the same semantics.
  •  
7.
  • Ismail, Mahmoud, et al. (författare)
  • HopsFS-S3 : Extending Object Stores with POSIX-like Semantics and more (industry track)
  • 2020
  • Ingår i: Proceedings of the 2020 21st international middleware conference industrial track (Middleware industry '20). - New York, NY, USA : Association for Computing Machinery (ACM). ; , s. 23-30
  • Konferensbidrag (refereegranskat)abstract
    • Object stores have become the de-facto platform for storage in the cloud due to their scalability, high availability, and low cost. However, they provide weaker metadata semantics and lower performance compared to distributed hierarchical file systems. In this paper, we introduce HopsFS-S3, a hybrid distributed hierarchical file system backed by an object store while preserving the file system's strong consistency semantics. We base our implementation on HopsFS, a next-generation distribution of HDFS with distributed metadata. We redesigned HopsFS' block storage layer to transparently use an object store to store the file's blocks without sacrificing the file system's semantics. We also introduced a new block caching service to leverage faster NVMe storage for hot blocks. In our experiments, we show that HopsFS-S3 outperforms EMRFS for IO-bound workloads, with up to 20% higher performance and delivers up to 3.4X the aggregated read throughput of EMRFS. Moreover, we demonstrate that metadata operations on HopsFS-S3 (such as directory rename) are up to two orders of magnitude faster than EMRFS. Finally, HopsFS-S3 opens up the currently closed metadata in object stores, enabling correctly-ordered change notifications with HopsFS' change data capture (CDC) API and customized extensions to metadata.
  •  
8.
  • Ismail, Mahmoud, et al. (författare)
  • Scalable Block Reporting for HopsFS
  • 2019
  • Ingår i: 2019 IEEE International Congress on Big Data (BigData Congress). - 9781728127712 ; , s. 157-164
  • Konferensbidrag (refereegranskat)abstract
    • Distributed hierarchical file systems typically de- couple the storage of the file system’s metadata from the data (file system blocks) to enable the scalability of the file system. This decoupling, however, requires the introduction of a periodic synchronization protocol to ensure the consistency of the file system’s metadata and its blocks. Apache HDFS and HopsFS implement a protocol, called block reporting, where each data server periodically sends ground truth information about all its file system blocks to the metadata servers, allowing the metadata to be synchronized with the actual state of the data blocks in the file system. The network and processing overhead of the existing block reporting protocol, however, increases with cluster size, ultimately limiting cluster scalability. In this paper, we introduce a new block reporting protocol for HopsFS that reduces the protocol bandwidth and processing overhead by up to three orders of magnitude, compared to HDFS/HopsFS’ existing protocol. Our new protocol removes a major bottleneck that prevented HopsFS clusters scaling to tens of thousands of servers.
  •  
9.
  • Ismail, Mahmoud, et al. (författare)
  • Scaling HDFS to more than 1 million operations per second with HopsFS
  • 2017
  • Ingår i: Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017. - : Institute of Electrical and Electronics Engineers Inc.. - 9781509066100 ; , s. 683-688
  • Konferensbidrag (refereegranskat)abstract
    • HopsFS is an open-source, next generation distribution of the Apache Hadoop Distributed File System(HDFS) that replaces the main scalability bottleneck in HDFS, single node in-memory metadata service, with a no-sharedstate distributed system built on a NewSQL database. By removing the metadata bottleneck in Apache HDFS, HopsFS enables significantly larger cluster sizes, more than an order of magnitude higher throughput, and significantly lower clientlatencies for large clusters. In this paper, we detail the techniques and optimizations that enable HopsFS to surpass 1 million file system operations per second-at least 16 times higher throughput than HDFS. In particular, we discuss how we exploit recent high performance features from NewSQL databases, such as application defined partitioning, partition-pruned index scans, and distribution aware transactions. Together with more traditional techniques, such as batching and write-Ahead caches, we show how many incremental optimizations have enabled a revolution in distributed hierarchical file system performance.
  •  
10.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 18

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy