Nan Xiao


R developer
Data science practitioner
Machine learning researcher

Preprints

Nan Xiao, Q.-S. Xu, and M.-Z. Li. hdnom: Building Nomograms for Penalized Cox Models with High-Dimensional Survival Data. bioRxiv. doi:10.1101/065524. abstract | software

2017

Y.-W. Lin, Nan Xiao, L.-L. Wang, C.-Q. Li, Q.-S. Xu. Ordered homogeneity pursuit lasso for group variable selection with applications to spectroscopic data. Chemometrics and Intelligent Laboratory Systems 168: 62-71, 2017. doi: 10.1016/j.chemolab.2017.07.004. abstract | website | software

2016

L. Shen, D.-S. Cao, Q.-S. Xu, X. Huang, Nan Xiao, Y.-Z. Liang. A novel local manifold-ranking based k-NN for modeling the regression between bioactivity and molecular descriptors. Chemometrics and Intelligent Laboratory Systems 151: 71-77, 2016. doi: 10.1016/j.chemolab.2015.12.005. abstract

2015

Nan Xiao and Q.-S. Xu. Multi-step adaptive elastic-net: reducing false positives in high-dimensional variable selection. Journal of Statistical Computation and Simulation 85(18): 3755-3765, 2015. doi: 10.1080/00949655.2015.1016944. abstract | software

Nan Xiao, D.-S. Cao, M.-F. Zhu, and Q.-S. Xu. protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequence. Bioinformatics 31(11): 1857-1859, 2015. doi: 10.1093/bioinformatics/btv042. abstract | software

D.-S. Cao*, Nan Xiao*, Q.-S. Xu and A. F. Chen. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds, and their interactions. Bioinformatics 31(2): 279-281, 2015. doi: 10.1093/bioinformatics/btu624. *Joint first authors. abstract | software

D.-S. Cao, Nan Xiao, Y.-J. Li, W.-B. Zeng, Y.-Z. Liang, A.-P. Lu, Q.-S. Xu, A. F. Chen. Integrating multiple evidence sources to predict adverse drug reactions based on systems pharmacology model. CPT: Pharmacometrics & Systems Pharmacology 4(9): 498–506, 2015. doi: 10.1002/psp4.12002. abstract | code and data to reproduce results from paper

J.-B. Wang, D.-S. Cao, M.-F. Zhu, Y.-H. Yun, Nan Xiao, Y.-Z. Liang. In silico evaluation of logD7.4 and comparison with other prediction methods. Journal of Chemometrics 29(7): 389-398, 2015. doi: 10.1002/cem.2718. abstract | dataset

Abstracts

hdnom: Building Nomograms for Penalized Cox Models with High-Dimensional Survival Data

Nan Xiao, Q.-S. Xu, and M.-Z. Li.

hdnom

Abstract:

Summary: We developed hdnom, an R package for survival modeling with high-dimensional data. The package is the first free and open-source software package that streamlines the workflow of penalized Cox model building, validation, calibration, comparison, and nomogram visualization, with nine types of penalized Cox regression methods fully supported. A web application and an online prediction tool maker are offered to enhance interactivity and flexibility in high-dimensional survival analysis.

Availability: The hdnom R package is available from CRAN: https://cran.r-project.org/package=hdnom under GPL. The hdnom web application can be accessed at http://hdnom.io. The web application maker is available from https://hdnom.org/appmaker. The hdnom project website: https://hdnom.org.

bioRxiv (2016).
top
Ordered Homogeneity Pursuit Lasso for Group Variable Selection with Applications to Spectroscopic Data

You-Wu Lin, Nan Xiao, Li-Li Wang, Chuan-Quan Li, Qing-Song Xu.

OHPL flowchart

Abstract: In high-dimensional data modeling, variable selection methods have been a popular choice to improve the prediction accuracy by effectively selecting the subset of informative variables, and such methods can enhance the model interpretability with sparse representation. In this study, we propose a novel group variable selection method named ordered homogeneity pursuit lasso (OHPL) that takes the homogeneity structure in high-dimensional data into account. OHPL is particularly useful in high-dimensional datasets with strongly correlated variables. We illustrate the approach using three real-world spectroscopic datasets and compare it with four state-of-the-art variable selection methods. The benchmark results on real-world data show that the proposed method is capable of identifying a small number of influential groups and has better prediction performance than its competitors. The OHPL method and the spectroscopic datasets are implemented and included in an R package OHPL available from https://OHPL.io.

Chemometrics and Intelligent Laboratory Systems (2017).
top
A novel local manifold-ranking based k-NN for modeling the regression between bioactivity and molecular descriptors

L. Shen, D.-S. Cao, Q.-S. Xu, X. Huang, Nan Xiao, Y.-Z. Liang.

MRKNN

Chemometrics and Intelligent Laboratory Systems (2015).
top
Multi-Step Adaptive Elastic-Net: Reducing false positives in high-dimensional variable selection

Nan Xiao and Q.-S. Xu.

msaenet

Abstract: Regression and variable selection in high-dimensional settings, especially when p >> n has been a popular research topic in statistical machine learning. In recent years, many successful methods have been developed to tackle this problem. In this paper, we propose the multi-step adaptive elastic-net (MSA-Enet), a multi-step estimation algorithm built upon adaptive elastic-net regularization. The numerical study on simulation data and real-world biological datasets have shown that the MSA-Enet method tends to significantly reduce the number of false-positive variables, while still maintain the estimation accuracy. By analyzing the variables eliminated in each step, more insight could be gained about the structure of the correlated variable groups. These properties are desirable in many real-world variable selection and regression problems.

Journal of Statistical Computation and Simulation (2015).
top
protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequence

Nan Xiao, D.-S. Cao, M.-F. Zhu, and Q.-S. Xu.

protr schematic

Abstract: Amino acid sequence-derived structural and physiochemical descriptors are extensively utilized for the research of structural, functional, expression and interaction profiles of proteins and peptides. We developed protr, a comprehensive R package for generating various numerical representation schemes of proteins and peptides from amino acid sequence. The package calculates eight descriptor groups composed of twenty two types of commonly used descriptors that include about 22,700 descriptor values. It allows users to select amino acid properties from the AAindex database, and use self-defined properties to construct customized descriptors. For proteochemometric modeling, it calculates six types of scales-based descriptors derived by various dimensionality reduction methods. The protr package also integrates the functionality of similarity score computation derived by protein sequence alignment and Gene Ontology (GO) semantic similarity measures within a list of proteins, and calculates profile-based protein features based on position-specific scoring matrix (PSSM). We also developed ProtrWeb, a user-friendly web server for calculating descriptors presented in the protr package. The protr package is freely available from CRAN. ProtrWeb is freely available at protr.org.

Bioinformatics (2015).
top
Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds, and their interactions

D.-S. Cao*, Nan Xiao*, Q.-S. Xu and A. F. Chen. *Joint first authors.

Rcpi schematic

Abstract: In chemoinformatics and bioinformatics fields, one of the main computational challenges in various predictive modeling is to find a suitable way to effectively represent the molecules under investigation such as small molecules, proteins and even complex interactions. To solve this problem, we developed a freely available R/Bioconductor package, called Compound-Protein Interaction with R (Rcpi), for complex molecular representation from drugs, proteins and more complex interactions including protein-protein and compound-protein interactions. Rcpi could calculate a large number of structural and physicochemical features of proteins and peptides from amino acid sequences, molecular descriptors of small molecules from their topology, and protein-protein interaction and compound-protein interaction descriptors. In addition to main functionalities, Rcpi could also provide a number of useful auxiliary utilities to facilitate the user's need. With the descriptors calculated by this package, the users could conveniently apply various statistical machine learning methods in R to solve various biological and drug research questions in computational biology and drug discovery. Rcpi is freely available from the Bioconductor website.

Bioinformatics (2015).
top
Integrating Multiple Evidence Sources to Predict Adverse Drug Reactions Based on Systems Pharmacology Model

D.-S. Cao, Nan Xiao, Y.-J. Li, W.-B. Zeng, Y.-Z. Liang, A.-P. Lu, Q.-S. Xu, A. F. Chen.

MEF schematic

Abstract: Identifying potential adverse drug reactions (ADRs) is critically important for drug discovery and public health. Here we developed a multiple evidence fusion (MEF) method for the large-scale prediction of drug ADRs that can handle both approved drugs and novel molecules. MEF is based on the similarity reference by collaborative filtering, and integrates multiple similarity measures from various data types, taking advantage of the complementarity in the data. We used MEF to integrate drug-related and ADR-related data from multiple levels, including the network structural data formed by known drug–ADR relationships for predicting likely unknown ADRs. On cross-validation, it obtains high sensitivity and specificity, substantially outperforming existing methods that utilize single or a few data types. We validated our prediction by their overlap with drug–ADR associations that are known in databases. The proposed computational method could be used for complementary hypothesis generation and rapid analysis of potential drug–ADR interactions.

CPT: Pharmacometrics & Systems Pharmacology (2015).
top
In silico Evaluation of logD7.4 and Comparison with Other Prediction Methods

J.-B. Wang, D.-S. Cao, M.-F. Zhu, Y.-H. Yun, Nan Xiao, Y.-Z. Liang.

logd corrgram

Abstract: Lipophilicity, evaluated by either n-octanol/water partition coefficient (logP) or n-octanol/buffer solution distribution coefficient (logD), is of high importance in pharmacology, toxicology and medicinal chemistry. A quantitative structure-property relationship (QSPR) study was carried out for the prediction of distribution coefficients at pH 7.4 (logD7.4) of a large data set consisting of 1,130 organic compounds with 30 molecular descriptors selected by genetic algorithm (GA). Partial least squares (PLS) and support vector machine (SVM) regression were used to build prediction models with 904 molecules as the training set, and the predictive ability was evaluated with 226 molecules as the external test set. The results exhibited by the regression statistics demonstrate that the SVM model is more reliable and has a better predictive accuracy than the PLS model. The square correlation coefficients of fitting, cross validation and prediction are 0.92, 0.90 and 0.90, respectively. The corresponding root mean square errors are 0.52, 0.59 and 0.56, respectively. The reliability and generalization ability of the model were assessed by applicability domain and Y-randomization test, and 30 selected molecular descriptors could give a reliable and direct interpretation to logD7.4 to some extent. When compared with the logD7.4 values calculated by five methods from Discovery Studio and ChemAxon, our SVM model shows superiority over them. The results indicate that our model is a reliable and promising method to evaluate logD7.4.

Journal of Chemometrics (2015).
top
© Nan Xiao 2017
[email protected]