Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models

↓ Direkt till sidans innehåll
↓ Direkt till sidans sekundära innehåll (sidomenyn)

Sökning: WFRF:(Pickett L) > (2020-2024) > Blinded Predictions...

Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models

Conn, Jonathan G.M. (författare): University of Strathclyde

Carter, James W. (författare): University of Strathclyde

Conn, Justin J.A. (författare): University of Strathclyde

visa fler...

Subramanian, Vigneshwari (författare): AstraZeneca AB

Baxter, Andrew (författare): GlaxoSmithKline

Engkvist, Ola, 1967 (författare): Chalmers tekniska högskola,Chalmers University of Technology,AstraZeneca AB

Llinas, Antonio (författare): AstraZeneca AB

Ratkova, Ekaterina L. (författare): AstraZeneca AB

Pickett, Stephen D. (författare): GlaxoSmithKline

Mcdonagh, James L. (författare)

Palmer, David S. (författare): University of Strathclyde

visa färre...

(creator_code:org_t)

2023-02-09
2023
Engelska.
Ingår i: Journal of Chemical Information and Modeling. - : American Chemical Society (ACS). - 1549-960X .- 1549-9596. ; 63:4, s. 1099-1113

Relaterad länk:: https://research.cha... (primary) (free); visa fler...; https://research.cha...; https://doi.org/10.1...; visa färre...

Tidskriftsartikel (refereegranskat)

Abstract Ämnesord

Stäng

Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge"in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets.

Ämnesord

NATURVETENSKAP -- Data- och informationsvetenskap -- Annan data- och informationsvetenskap (hsv//swe)
NATURAL SCIENCES -- Computer and Information Sciences -- Other Computer and Information Science (hsv//eng)
NATURVETENSKAP -- Data- och informationsvetenskap -- Bioinformatik (hsv//swe)
NATURAL SCIENCES -- Computer and Information Sciences -- Bioinformatics (hsv//eng)
NATURVETENSKAP -- Biologi -- Bioinformatik och systembiologi (hsv//swe)
NATURAL SCIENCES -- Biological Sciences -- Bioinformatics and Systems Biology (hsv//eng)

Nyckelord

Learning algorithms
Convolutional neural networks
Deep learning
Solubility prediction
Molecules

Publikations- och innehållstyp

art (ämneskategori)
ref (ämneskategori)

Hitta via bibliotek

Journal of Chemical Information and Modeling (Sök värdpublikationen i LIBRIS)

Till lärosätets databas

Hitta mer i SwePub

Av författaren/redakt...: Conn, Jonathan G ...; Carter, James W.; Conn, Justin J.A ...; Subramanian, Vig ...; Baxter, Andrew; Engkvist, Ola, 1 ...; visa fler...; Llinas, Antonio; Ratkova, Ekateri ...; Pickett, Stephen ...; Mcdonagh, James ...; Palmer, David S.; visa färre...

Om ämnet

NATURVETENSKAP: NATURVETENSKAP; och Data och informa ...; och Annan data och i ...

NATURVETENSKAP: NATURVETENSKAP; och Data och informa ...; och Bioinformatik

NATURVETENSKAP: NATURVETENSKAP; och Biologi; och Bioinformatik oc ...

Artiklar i publikationen: Journal of Chemi ...

Av lärosätet: Chalmers tekniska högskola

Sök utanför SwePub

Sök vidare i:: Google; Google Book Search; Google Scholar

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

LIBRIS.kb.se