data science practitioner
machine learning researcher
L.-L. Wang, Y.-W. Lin, X.-F. Wang,
L. Shen, D.-S. Cao, Q.-S. Xu, X. Huang,
J.-B. Wang, D.-S. Cao, M.-F. Zhu, Y.-H. Yun,
Summary: We developed hdnom, an R package for survival modeling with high-dimensional data. The package is the first free and open-source software package that streamlines the workflow of penalized Cox model building, validation, calibration, comparison, and nomogram visualization, with nine types of penalized Cox regression methods fully supported. A web application and an online prediction tool maker are offered to enhance interactivity and flexibility in high-dimensional survival analysis.
Availability: The hdnom R package is available from CRAN: https://cran.r-project.org/package=hdnom under GPL. The hdnom web application can be accessed at http://hdnom.io. The web application maker is available from https://hdnom.org/appmaker. The hdnom project website: https://hdnom.org.
Li-Li Wang, You-Wu Lin, Xu-Fei Wang,
Abstract: Dimension reduction and variable selection are two types of effective methods that deal with high-dimensional data. In particular, variable selection techniques are of wide-spread use and essentially consist of individual selection methods and interval selection methods. Given the fact that the vibrational spectra have continuous features of spectral bands, interval selection instead of individual spectral wavelength point selection allows for more stable models and easier interpretation. Numerous methods have been suggested for interval selection recently. Therefore, this paper is devoted to a selective review on interval selection methods with partial least squares (PLS) as the calibration model. We described the algorithms in the five classes: classic methods, penalty-based, sampling-based, correlation-based, and projection-based methods. Finally, we compared and discussed the performances of a subset of these methods on three real-world spectroscopic datasets.
Abstract: In high-dimensional data modeling, variable selection methods have been a popular choice to improve the prediction accuracy by effectively selecting the subset of informative variables, and such methods can enhance the model interpretability with sparse representation. In this study, we propose a novel group variable selection method named ordered homogeneity pursuit lasso (OHPL) that takes the homogeneity structure in high-dimensional data into account. OHPL is particularly useful in high-dimensional datasets with strongly correlated variables. We illustrate the approach using three real-world spectroscopic datasets and compare it with four state-of-the-art variable selection methods. The benchmark results on real-world data show that the proposed method is capable of identifying a small number of influential groups and has better prediction performance than its competitors. The OHPL method and the spectroscopic datasets are implemented and included in an R package OHPL available from https://OHPL.io.
L. Shen, D.-S. Cao, Q.-S. Xu, X. Huang,
Abstract: Regression and variable selection in high-dimensional settings, especially when p >> n has been a popular research topic in statistical machine learning. In recent years, many successful methods have been developed to tackle this problem. In this paper, we propose the multi-step adaptive elastic-net (MSA-Enet), a multi-step estimation algorithm built upon adaptive elastic-net regularization. The numerical study on simulation data and real-world biological datasets have shown that the MSA-Enet method tends to significantly reduce the number of false-positive variables, while still maintain the estimation accuracy. By analyzing the variables eliminated in each step, more insight could be gained about the structure of the correlated variable groups. These properties are desirable in many real-world variable selection and regression problems.
Abstract: Amino acid sequence-derived structural and physiochemical descriptors are extensively utilized for the research of structural, functional, expression and interaction profiles of proteins and peptides. We developed protr, a comprehensive R package for generating various numerical representation schemes of proteins and peptides from amino acid sequence. The package calculates eight descriptor groups composed of twenty two types of commonly used descriptors that include about 22,700 descriptor values. It allows users to select amino acid properties from the AAindex database, and use self-defined properties to construct customized descriptors. For proteochemometric modeling, it calculates six types of scales-based descriptors derived by various dimensionality reduction methods. The protr package also integrates the functionality of similarity score computation derived by protein sequence alignment and Gene Ontology (GO) semantic similarity measures within a list of proteins, and calculates profile-based protein features based on position-specific scoring matrix (PSSM). We also developed ProtrWeb, a user-friendly web server for calculating descriptors presented in the protr package. The protr package is freely available from CRAN. ProtrWeb is freely available at protr.org.
Abstract: In chemoinformatics and bioinformatics fields, one of the main computational challenges in various predictive modeling is to find a suitable way to effectively represent the molecules under investigation such as small molecules, proteins and even complex interactions. To solve this problem, we developed a freely available R/Bioconductor package, called Compound-Protein Interaction with R (Rcpi), for complex molecular representation from drugs, proteins and more complex interactions including protein-protein and compound-protein interactions. Rcpi could calculate a large number of structural and physicochemical features of proteins and peptides from amino acid sequences, molecular descriptors of small molecules from their topology, and protein-protein interaction and compound-protein interaction descriptors. In addition to main functionalities, Rcpi could also provide a number of useful auxiliary utilities to facilitate the user's need. With the descriptors calculated by this package, the users could conveniently apply various statistical machine learning methods in R to solve various biological and drug research questions in computational biology and drug discovery. Rcpi is freely available from the Bioconductor website.
Abstract: Identifying potential adverse drug reactions (ADRs) is critically important for drug discovery and public health. Here we developed a multiple evidence fusion (MEF) method for the large-scale prediction of drug ADRs that can handle both approved drugs and novel molecules. MEF is based on the similarity reference by collaborative filtering, and integrates multiple similarity measures from various data types, taking advantage of the complementarity in the data. We used MEF to integrate drug-related and ADR-related data from multiple levels, including the network structural data formed by known drug–ADR relationships for predicting likely unknown ADRs. On cross-validation, it obtains high sensitivity and specificity, substantially outperforming existing methods that utilize single or a few data types. We validated our prediction by their overlap with drug–ADR associations that are known in databases. The proposed computational method could be used for complementary hypothesis generation and rapid analysis of potential drug–ADR interactions.
J.-B. Wang, D.-S. Cao, M.-F. Zhu, Y.-H. Yun,
Abstract: Lipophilicity, evaluated by either n-octanol/water partition coefficient (logP) or n-octanol/buffer solution distribution coefficient (logD), is of high importance in pharmacology, toxicology and medicinal chemistry. A quantitative structure-property relationship (QSPR) study was carried out for the prediction of distribution coefficients at pH 7.4 (logD7.4) of a large data set consisting of 1,130 organic compounds with 30 molecular descriptors selected by genetic algorithm (GA). Partial least squares (PLS) and support vector machine (SVM) regression were used to build prediction models with 904 molecules as the training set, and the predictive ability was evaluated with 226 molecules as the external test set. The results exhibited by the regression statistics demonstrate that the SVM model is more reliable and has a better predictive accuracy than the PLS model. The square correlation coefficients of fitting, cross validation and prediction are 0.92, 0.90 and 0.90, respectively. The corresponding root mean square errors are 0.52, 0.59 and 0.56, respectively. The reliability and generalization ability of the model were assessed by applicability domain and Y-randomization test, and 30 selected molecular descriptors could give a reliable and direct interpretation to logD7.4 to some extent. When compared with the logD7.4 values calculated by five methods from Discovery Studio and ChemAxon, our SVM model shows superiority over them. The results indicate that our model is a reliable and promising method to evaluate logD7.4.