SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "L773:2047 217X OR L773:2047 217X ;pers:(Spjuth Ola)"

Sökning: L773:2047 217X OR L773:2047 217X > Spjuth Ola

  • Resultat 1-8 av 8
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Lampa, Samuel, et al. (författare)
  • Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data
  • 2013
  • Ingår i: GigaScience. - 2047-217X. ; 2:1, s. 1-10
  • Tidskriftsartikel (refereegranskat)abstract
    • Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made.
  •  
2.
  • Blamey, Ben, et al. (författare)
  • Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit
  • 2021
  • Ingår i: GigaScience. - : Oxford University Press. - 2047-217X. ; 10:3, s. 1-14
  • Tidskriftsartikel (refereegranskat)abstract
    • BACKGROUND: Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources.FINDINGS: In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope.CONCLUSIONS: Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.
  •  
3.
  • Capuccini, Marco, et al. (författare)
  • MaRe : Processing Big Data with application containers on Apache Spark
  • 2020
  • Ingår i: GigaScience. - : Oxford University Press. - 2047-217X. ; 9:5
  • Tidskriftsartikel (refereegranskat)abstract
    • Background: Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. Results: Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. Conclusions: MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.
  •  
4.
  • Dahlö, Martin, et al. (författare)
  • Tracking the NGS revolution : managing life science research on shared high-performance computing clusters
  • 2018
  • Ingår i: GigaScience. - : Oxford University Press. - 2047-217X. ; 7:5
  • Tidskriftsartikel (refereegranskat)abstract
    • BackgroundNext-generation sequencing (NGS) has transformed the life sciences, and many research groups are newly dependent upon computer clusters to store and analyze large datasets. This creates challenges for e-infrastructures accustomed to hosting computationally mature research in other sciences. Using data gathered from our own clusters at UPPMAX computing center at Uppsala University, Sweden, where core hour usage of ∼800 NGS and ∼200 non-NGS projects is now similar, we compare and contrast the growth, administrative burden, and cluster usage of NGS projects with projects from other sciences.ResultsThe number of NGS projects has grown rapidly since 2010, with growth driven by entry of new research groups. Storage used by NGS projects has grown more rapidly since 2013 and is now limited by disk capacity. NGS users submit nearly twice as many support tickets per user, and 11 more tools are installed each month for NGS projects than for non-NGS projects. We developed usage and efficiency metrics and show that computing jobs for NGS projects use more RAM than non-NGS projects, are more variable in core usage, and rarely span multiple nodes. NGS jobs use booked resources less efficiently for a variety of reasons. Active monitoring can improve this somewhat.ConclusionsHosting NGS projects imposes a large administrative burden at UPPMAX due to large numbers of inexperienced users and diverse and rapidly evolving research areas. We provide a set of recommendations for e-infrastructures that host NGS research projects. We provide anonymized versions of our storage, job, and efficiency databases.
  •  
5.
  • Lampa, Samuel, et al. (författare)
  • SciPipe : A workflow library for agile development of complex and dynamic bioinformatics pipelines
  • 2019
  • Ingår i: GigaScience. - : Oxford University Press (OUP). - 2047-217X. ; 8:5
  • Tidskriftsartikel (refereegranskat)abstract
    • Background: The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning. Findings: SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline. Conclusions: SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.
  •  
6.
  •  
7.
  • Siretskiy, Alexey, et al. (författare)
  • A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data
  • 2015
  • Ingår i: GigaScience. - : Oxford University Press (OUP). - 2047-217X. ; 4
  • Tidskriftsartikel (refereegranskat)abstract
    • Background: New high-throughput technologies, such as massively parallel sequencing, have transformed the life sciences into a data-intensive field. The most common e-infrastructure for analyzing this data consists of batch systems that are based on high-performance computing resources; however, the bioinformatics software that is built on this platform does not scale well in the general case. Recently, the Hadoop platform has emerged as an interesting option to address the challenges of increasingly large datasets with distributed storage, distributed processing, built-in data locality, fault tolerance, and an appealing programming methodology. Results: In this work we introduce metrics and report on a quantitative comparison between Hadoop and a single node of conventional high-performance computing resources for the tasks of short read mapping and variant calling. We calculate efficiency as a function of data size and observe that the Hadoop platform is more efficient for biologically relevant data sizes in terms of computing hours for both split and un-split data files. We also quantify the advantages of the data locality provided by Hadoop for NGS problems, and show that a classical architecture with network-attached storage will not scale when computing resources increase in numbers. Measurements were performed using ten datasets of different sizes, up to 100 gigabases, using the pipeline implemented in Crossbow. To make a fair comparison, we implemented an improved preprocessor for Hadoop with better performance for splittable data files. For improved usability, we implemented a graphical user interface for Crossbow in a private cloud environment using the CloudGene platform. All of the code and data in this study are freely available as open source in public repositories. Conclusions: From our experiments we can conclude that the improved Hadoop pipeline scales better than the same pipeline on high-performance computing resources, we also conclude that Hadoop is an economically viable option for the common data sizes that are currently used in massively parallel sequencing. Given that datasets are expected to increase over time, Hadoop is a framework that we envision will have an increasingly important role in future biological data analysis.
  •  
8.
  • Spjuth, Ola, et al. (författare)
  • Recommendations on e-infrastructures for next-generation sequencing
  • 2016
  • Ingår i: GigaScience. - : Oxford University Press (OUP). - 2047-217X. ; 5
  • Forskningsöversikt (refereegranskat)abstract
    • With ever-increasing amounts of data being produced by next-generation sequencing (NGS) experiments, the requirements placed on supporting e-infrastructures have grown. In this work, we provide recommendations based on the collective experiences from participants in the EU COST Action SeqAhead for the tasks of data preprocessing, upstream processing, data delivery, and downstream analysis, as well as long-term storage and archiving. We cover demands on computational and storage resources, networks, software stacks, automation of analysis, education, and also discuss emerging trends in the field. E-infrastructures for NGS require substantial effort to set up and maintain over time, and with sequencing technologies and best practices for data analysis evolving rapidly it is important to prioritize both processing capacity and e-infrastructure flexibility when making strategic decisions to support the data analysis demands of tomorrow. Due to increasingly demanding technical requirements we recommend that e-infrastructure development and maintenance be handled by a professional service unit, be it internal or external to the organization, and emphasis should be placed on collaboration between researchers and IT professionals.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-8 av 8
Typ av publikation
tidskriftsartikel (7)
forskningsöversikt (1)
Typ av innehåll
refereegranskat (8)
Författare/redaktör
Dahlö, Martin (6)
Lampa, Samuel (3)
Capuccini, Marco (2)
Toor, Salman (2)
Spjuth, Ola, Docent, ... (2)
visa fler...
Spjuth, Ola, Docent (2)
Larsson, Anders (1)
Hellander, Andreas (1)
Emami Khoonsari, Pay ... (1)
Kultima, Kim (1)
Hankemeier, Thomas (1)
Schaal, Wesley, PhD (1)
Alvarsson, Jonathan, ... (1)
Vezzi, Francesco (1)
Dahlberg, Johan (1)
Sabirsh, Alan (1)
Ólason, Páll I. (1)
Bongcam Rudloff, Eri ... (1)
Neumann, Steffen (1)
Spjuth, Ola, 1977- (1)
Spjuth, Ola, Profess ... (1)
O'Donovan, Claire (1)
Hagberg, Jonas (1)
Wählby, Carolina, pr ... (1)
Ebbels, Timothy M D (1)
Glen, Robert (1)
Salek, Reza M (1)
Kale, Namrata (1)
Haug, Kenneth (1)
Schober, Daniel (1)
Rocca-Serra, Philipp ... (1)
Steinbeck, Christoph (1)
de Atauri, Pedro (1)
Cascante, Marta (1)
Zanetti, Gianluigi (1)
Scofield, Douglas (1)
Harrison, Philip J (1)
Sintorn, Ida-Maria, ... (1)
Wieslander, Håkan (1)
Pearce, Jake T. M. (1)
Blamey, Ben (1)
Bergmann, Sven (1)
Novella, Jon Ander (1)
Sadawi, Noureddin (1)
Herman, Stephanie (1)
Rueedi, Rico (1)
Karaman, Ibrahim (1)
Johnson, David (1)
Siretskiy, Alexey (1)
visa färre...
Lärosäte
Uppsala universitet (8)
Stockholms universitet (2)
Malmö universitet (1)
Sveriges Lantbruksuniversitet (1)
Språk
Engelska (8)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (7)
Teknik (1)
Medicin och hälsovetenskap (1)

År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy