Data Science Practitioner
Machine Learning Researcher
Abstract: Regression and variable selection in high-dimensional settings, especially when p >> n has been a popular research topic in statistical machine learning. In recent years, many successful methods have been developed to tackle this problem. In this paper, we propose the multi-step adaptive elastic-net (MSA-Enet), a multi-step estimation algorithm built upon adaptive elastic-net regularization. The numerical study on simulation data and real-world biological datasets have shown that the MSA-Enet method tends to significantly reduce the number of false-positive variables, while still maintain the estimation accuracy. By analyzing the variables eliminated in each step, more insight could be gained about the structure of the correlated variable groups. These properties are desirable in many real-world variable selection and regression problems.
Abstract: Amino acid sequence-derived structural and physiochemical descriptors are extensively utilized for the research of structural, functional, expression and interaction profiles of proteins and peptides. We developed protr, a comprehensive R package for generating various numerical representation schemes of proteins and peptides from amino acid sequence. The package calculates eight descriptor groups composed of twenty two types of commonly used descriptors that include about 22,700 descriptor values. It allows users to select amino acid properties from the AAindex database, and use self-defined properties to construct customized descriptors. For proteochemometric modeling, it calculates six types of scales-based descriptors derived by various dimensionality reduction methods. The protr package also integrates the functionality of similarity score computation derived by protein sequence alignment and Gene Ontology (GO) semantic similarity measures within a list of proteins, and calculates profile-based protein features based on position-specific scoring matrix (PSSM). We also developed ProtrWeb, a user-friendly web server for calculating descriptors presented in the protr package. The protr package is freely available from CRAN (link). ProtrWeb is freely available at this link.
Abstract: In chemoinformatics and bioinformatics fields, one of the main computational challenges in various predictive modeling is to find a suitable way to effectively represent the molecules under investigation such as small molecules, proteins and even complex interactions. To solve this problem, we developed a freely available R/Bioconductor package, called Compound-Protein Interaction with R (Rcpi), for complex molecular representation from drugs, proteins and more complex interactions including protein-protein and compound-protein interactions. Rcpi could calculate a large number of structural and physicochemical features of proteins and peptides from amino acid sequences, molecular descriptors of small molecules from their topology, and protein-protein interaction and compound-protein interaction descriptors. In addition to main functionalities, Rcpi could also provide a number of useful auxiliary utilities to facilitate the user's need. With the descriptors calculated by this package, the users could conveniently apply various statistical machine learning methods in R to solve various biological and drug research questions in computational biology and drug discovery. Rcpi is freely available from the Bioconductor website (link).
Abstract: Identifying potential adverse drug reactions (ADRs) is critically important for drug discovery and public health. Here we developed a multiple evidence fusion (MEF) method for the large-scale prediction of drug ADRs that can handle both approved drugs and novel molecules. MEF is based on the similarity reference by collaborative filtering, and integrates multiple similarity measures from various data types, taking advantage of the complementarity in the data. We used MEF to integrate drug-related and ADR-related data from multiple levels, including the network structural data formed by known drug–ADR relationships for predicting likely unknown ADRs. On cross-validation, it obtains high sensitivity and specificity, substantially outperforming existing methods that utilize single or a few data types. We validated our prediction by their overlap with drug–ADR associations that are known in databases. The proposed computational method could be used for complementary hypothesis generation and rapid analysis of potential drug–ADR interactions.
Abstract: In high-dimensional data modeling, variable selection methods have been a popular choice to improve the prediction accuracy by effectively selecting the subset of informative variables, and such methods can enhance the model interpretability with sparse representation. In this study, we propose a novel group variable selection method named ordered homogeneity pursuit lasso (OHPL) that takes the homogeneity structure in high-dimensional data into account. OHPL is particularly useful in high-dimensional datasets with strongly correlated variables. We illustrate the approach using three real-world spectroscopic datasets and compare it with four state-of-the-art variable selection methods. The benchmark results on real-world data show that the proposed method is capable of identifying a small number of influential groups and has better prediction performance than its competitors. The OHPL method and the spectroscopic datasets are implemented and included in an R package OHPL available from https://OHPL.io.
J.-B. Wang, D.-S. Cao, M.-F. Zhu, Y.-H. Yun,
Abstract: Lipophilicity, evaluated by either n-octanol/water partition coefficient (logP) or n-octanol/buffer solution distribution coefficient (logD), is of high importance in pharmacology, toxicology and medicinal chemistry. A quantitative structure-property relationship (QSPR) study was carried out for the prediction of distribution coefficients at pH 7.4 (logD7.4) of a large data set consisting of 1,130 organic compounds with 30 molecular descriptors selected by genetic algorithm (GA). Partial least squares (PLS) and support vector machine (SVM) regression were used to build prediction models with 904 molecules as the training set, and the predictive ability was evaluated with 226 molecules as the external test set. The results exhibited by the regression statistics demonstrate that the SVM model is more reliable and has a better predictive accuracy than the PLS model. The square correlation coefficients of fitting, cross validation and prediction are 0.92, 0.90 and 0.90, respectively. The corresponding root mean square errors are 0.52, 0.59 and 0.56, respectively. The reliability and generalization ability of the model were assessed by applicability domain and Y-randomization test, and 30 selected molecular descriptors could give a reliable and direct interpretation to logD7.4 to some extent. When compared with the logD7.4 values calculated by five methods from Discovery Studio and ChemAxon, our SVM model shows superiority over them. The results indicate that our model is a reliable and promising method to evaluate logD7.4.
L. Shen, D.-S. Cao, Q.-S. Xu, X. Huang,
Summary: We developed hdnom, an R package for survival modeling with high-dimensional data. The package is the first free and open-source software package that streamlines the workflow of penalized Cox model building, validation, calibration, comparison, and nomogram visualization, with nine types of penalized Cox regression methods fully supported. A web application and an online prediction tool maker are offered to enhance interactivity and flexibility in high-dimensional survival analysis.
Availability: The hdnom R package is available from CRAN: https://cran.r-project.org/package=hdnom under GPL. The hdnom web application can be accessed at http://hdnom.io. The web application maker is available from https://hdnom.org/appmaker. The hdnom project website: https://hdnom.org.
Abstract: We have accumulated numerous excellent software packages for analyzing large-scale biomedical data on the way to delivering on the promise of human genomics. Bioconductor workflows illustrated the feasibility of organizing and demonstrating such software collections in a reproducible and human-readable way. Going forward, how to implement fully automatic workflow execution and persistently reproducible report compilation on an industrial-scale becomes challenging from the engineering perspective. For example, the software tools across workflows usually require drastically different system dependencies and execution environments and thus need to be isolated completely. As one of the first efforts exploring the possibility of bioinformatics workflow containerization and orchestration using Docker, the DockFlow project aims to containerize every single existing Bioconductor workflow in a clean, smooth, and scalable way. We will show that with the help of our R package liftr, it is possible to achieve the goal of persistent reproducible workflow containerization by simply creating and managing a YAML configuration file for each workflow. We will also share our experience and the pitfalls encountered during such containerization efforts, which may offer some best practices and valuable references for creating similar bioinformatics workflows in the future. The DockFlow project website: https://dockflow.org.
Poster of the final project for the class HGEN 48600/STAT 35450 (Fundamentals of Computational Biology: Models and Inference) at The University of Chicago in 2016. We explored the hypothesis if there is color topics used in the visual design of movie posters with the generative model STRUCTURE, or namely, Latent Dirichlet Allocation.
Abstract: liftr extends the R Markdown metadata format. It helps you generate Dockerfile for rendering R Markdown documents in Docker containers. Users can also include and run pre-defined Rabix tools/workflows, then analyze the Rabix output in the dockerized R Markdown documents.