SwePub
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "WFRF:(Guo Yike) srt2:(2014)"

Sökning: WFRF:(Guo Yike) > (2014)

  • Resultat 1-3 av 3
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Wang, Shicai, et al. (författare)
  • DSIMBench : A benchmark for microarray data using R
  • 2014
  • Ingår i: BPOE 2014: Big Data Benchmarks, Performance Optimization, and Emerging Hardware. - Cham : Springer. ; , s. 47-56
  • Konferensbidrag (refereegranskat)abstract
    • Parallel computing in R has been widely used to analyse microarray data. We have seen various applications using various data distribution and calculation approaches. Newer data storage systems, such as MySQL Cluster and HBase, have been proposed for R data storage; while the parallel computation frameworks, including MPI and MapReduce, have been applied to R computation. Thus, it is difficult to understand the whole analysis workflows for which the tool kits are suited for a specific environment. In this paper we propose DSIMBench, a benchmark containing two classic microarray analysis functions with eight different parallel R workflows, and evaluate the benchmark in the IC Cloud testbed platform.
  •  
2.
  • Wang, Shicai, et al. (författare)
  • High dimensional biological data retrieval optimization with NoSQL technology
  • 2014
  • Ingår i: BMC Genomics. - : BioMed Central. - 1471-2164. ; 15:Suppl 8
  • Tidskriftsartikel (refereegranskat)abstract
    • BackgroundHigh-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data.ResultsIn this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB.ConclusionsThe performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
  •  
3.
  • Wang, Shicai, et al. (författare)
  • Optimising parallel R correlation matrix calculations on gene expression data using MapReduce
  • 2014
  • Ingår i: BMC Bioinformatics. - : BioMed Central. - 1471-2105. ; 5
  • Tidskriftsartikel (refereegranskat)abstract
    • BackgroundHigh-throughput molecular profiling data has been used to improve clinical decision making by stratifying subjects based on their molecular profiles. Unsupervised clustering algorithms can be used for stratification purposes. However, the current speed of the clustering algorithms cannot meet the requirement of large-scale molecular data due to poor performance of the correlation matrix calculation. With high-throughput sequencing technologies promising to produce even larger datasets per subject, we expect the performance of the state-of-the-art statistical algorithms to be further impacted unless efforts towards optimisation are carried out. MapReduce is a widely used high performance parallel framework that can solve the problem.ResultsIn this paper, we evaluate the current parallel modes for correlation calculation methods and introduce an efficient data distribution and parallel calculation algorithm based on MapReduce to optimise the correlation calculation. We studied the performance of our algorithm using two gene expression benchmarks. In the micro-benchmark, our implementation using MapReduce, based on the R package RHIPE, demonstrates a 3.26-5.83 fold increase compared to the default Snowfall and 1.56-1.64 fold increase compared to the basic RHIPE in the Euclidean, Pearson and Spearman correlations. Though vanilla R and the optimised Snowfall outperforms our optimised RHIPE in the micro-benchmark, they do not scale well with the macro-benchmark. In the macro-benchmark the optimised RHIPE performs 2.03-16.56 times faster than vanilla R. Benefiting from the 3.30-5.13 times faster data preparation, the optimised RHIPE performs 1.22-1.71 times faster than the optimised Snowfall. Both the optimised RHIPE and the optimised Snowfall successfully performs the Kendall correlation with TCGA dataset within 7 hours. Both of them conduct more than 30 times faster than the estimated vanilla R.ConclusionsThe performance evaluation found that the new MapReduce algorithm and its implementation in RHIPE outperforms vanilla R and the conventional parallel algorithms implemented in R Snowfall. We propose that MapReduce framework holds great promise for large molecular data analysis, in particular for high-dimensional genomic data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new algorithm as a basis for optimising high-throughput molecular data correlation calculation for Big Data.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-3 av 3
Typ av publikation
tidskriftsartikel (2)
konferensbidrag (1)
Typ av innehåll
refereegranskat (3)
Författare/redaktör
Pandis, Ioannis (3)
Johnson, David (3)
Guo, Yike (3)
Wang, Shicai (3)
Emam, Ibrahim (3)
Guitton, Florian (3)
visa fler...
Oehmichen, Axel (2)
Wu, Chao (1)
He, Sijin (1)
visa färre...
Lärosäte
Uppsala universitet (3)
Språk
Engelska (3)
Forskningsämne (UKÄ/SCB)
Naturvetenskap (3)
År

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy