Nan Xiao

R developer
data science practitioner
machine learning researcher


The Deep Connection between Drugs and Side Effects

Nan Xiao


ISCB Art in Science Competition. ISMB/ECCB 2017, Prague, Czech Republic.
DockFlow: Bioconductor Workflow Containerization and Orchestration with liftr

Nan Xiao, Tengfei Yin, and Miaozhu Li.


Abstract: We have accumulated numerous excellent software packages for analyzing large-scale biomedical data on the way to delivering on the promise of human genomics. Bioconductor workflows illustrated the feasibility of organizing and demonstrating such software collections in a reproducible and human-readable way. Going forward, how to implement fully automatic workflow execution and persistently reproducible report compilation on an industrial-scale becomes challenging from the engineering perspective. For example, the software tools across workflows usually require drastically different system dependencies and execution environments and thus need to be isolated completely. As one of the first efforts exploring the possibility of bioinformatics workflow containerization and orchestration using Docker, the DockFlow project aims to containerize every single existing Bioconductor workflow in a clean, smooth, and scalable way. We will show that with the help of our R package liftr, it is possible to achieve the goal of persistent reproducible workflow containerization by simply creating and managing a YAML configuration file for each workflow. We will also share our experience and the pitfalls encountered during such containerization efforts, which may offer some best practices and valuable references for creating similar bioinformatics workflows in the future. The DockFlow project website:

BioC 2017, July 26-28, Dana-Farber Cancer Institute, Boston, MA.
ChromaClust: Latent Color Topic Modeling for Images

Nan Xiao


Poster of the final project for the class HGEN 48600/STAT 35450 (Fundamentals of Computational Biology: Models and Inference) at The University of Chicago in 2016. We explored the hypothesis if there is color topics used in the visual design of movie posters with the generative model STRUCTURE, or namely, Latent Dirichlet Allocation.

March 8, 2016. Gordon Center for Integrative Sciences, The University of Chicago.
liftr: Reproducible Bioinformatics and Statistical Data Analysis with Docker, Rabix, and knitr

Nan Xiao, Tengfei Yin, and Miaozhu Li.


Abstract: liftr extends the R Markdown metadata format. It helps you generate Dockerfile for rendering R Markdown documents in Docker containers. Users can also include and run pre-defined Rabix tools/workflows, then analyze the Rabix output in the dockerized R Markdown documents.

BioC 2015, July 20-22, Fred Hutchinson Cancer Research Center, Seattle, WA.


Persistent Reproducible Reporting with Docker and R

Invited talk at the 10th China R Conference. Tsinghua University, Beijing, China. May 20, 2017.

Abstract: Automatic report generation has a massive number of use cases for reproducible research and commercial applications. Fortunately, most of the problems involved in this topic have been elegantly solved by knitr and the R Markdown specification for the R community. However, the issues on data persistence and operating system-level reproducibility were rarely considered in the context of reproducible report generation. Today, such issues have become a major concern in the current software implementations. In this talk, we will discuss potential approaches to tackle such problems, particularly with the help of modern containerization technologies. We will also demonstrate how to compose a persistent and reproducible R Markdown report with the help of the two R packages we developed: docker-r and liftr. Specifically, you will learn to dockerize your existing R Markdown documents, how to apply it to the analysis of petabyte-scale cancer genomics data on the Cancer Genomics Cloud, and how to distribute or reuse such containerized reports.
Reproducible Dynamic Report Generation with Docker and R

Invited talk at DockerCon 2017. Austin, TX. April, 2017.

Abstract: Automatic report generation is extensively needed in reproducible research and commercial applications. However, operation system-level reproducibility is still a huge concern in the current implementations. I'm going to demonstrate how easy it is to write a dynamic and reproducible report with the help of Docker, Docker API R client package, and the R package liftr we developed. Specifically, you will see how to dockerize your existing R Markdown documents, with applications to the analysis of petabyte-scale cancer genomics data, and the potential to distribute and reuse such reports.
Cancer Genomics Cloud & R: Find, Access, and Analyze Petabyte-Scale Cancer Genomic Data on the Cloud

Invited talk at Boston R/Bioconductor for Genomics Meetup. Dana-Farber Cancer Institute. January 12, 2017.

Introduction of Cancer Genomics Cloud and the R client package sevenbridges. Boston Bioconductor Meetup in January 2017.

High-Dimensional Survival Modeling with Shiny

Invited talk at Shiny Developers Conference. Stanford University. January 30, 2016.

Abstract: In the talk, we will demonstrate, a Shiny application for high-dimensional survival modeling. The application supports automatic model building, validation, calibration, comparison, Kaplan-Meier analysis of risk groups, and reproducible report generation. With the application, physicians and clinical researchers can build prognostic models, validate model performance, and prepare publication-quality figures easily, in a fully reproducible way.
Introduction to Reproducible Research in Bioinformatics

Invited talk at 2015 Bioinformatics Workshop. Center for Research Informatics, The University of Chicago. December 3, 2015.

Abstract: We introduced the modern concepts, principles, tools, and challenges in reproducible (computational) research at the workshop. With some coverage of the following topics:

  1. Workflow automation (GNU make & workflow systems);
  2. R & Python packages;
  3. knitr & IPython Notebook;
  4. Version control system: git & GitHub;
  5. Package dependency management: packrat & virtualenv;
  6. System dependency management: Docker & liftr.
liftr & sbgr kickstart

Invited workshop (joint with Dan Tenenbaum & Tengfei Yin) at BioC 2015. Fred Hutchinson Cancer Research Center, Seattle, WA. July 21, 2015.

Abstract: We will introduce common workflow language and R package cwl, the implementation with Rabix , then a demo about how to write R command line tool with docopt, how to convert your R command line tool to CWL, how to use rabix R package's R interface to describe your tool, and use Rabix to develop, deploy and run it on AWS cloud with SBG platform or run it locally. We will also demonstrate dockerizing R Markdown documents with Rabix support using the liftr package; automating a workflow from raw data uploading, pipeline running, and report retrieving with the sbgr API package.

Supervised Distance Metric Learning: A Retrospective

Presented at the Computational Biology & Drug Design (CBDD) Group, Central South University. December, 2013.

Abstract: The need for appropriate methods to measure the similarity between data points is urgent in machine learning research, but handcrafting good metrics for specific problems is difficult. This has led to the emergence of supervised distance metric learning, which aims at automatically learning a metric from data, for the past decade. The talk gives a review of the successful methods in the field of supervised distance metric learning, discussed the pros and cons of each approach, especially RCA, NCA, ITML and LMNN.

Keywords: distance metric learning

Web Scraping with R

Invited talk at the 6th China R Conference. Renmin University of China, Beijing. May 18, 2013.

Abstract: The web itself is the world's largest, public-accessible data source. Knowing how to scrape data from the web has become one must-have skill, particularly for data hackers. In this report, you will learn the basic coding strategies and neat tricks for web scraping with R. While introducing how to retrieve data from the web and parse a variety of data formats, we will summarize the usage and application scenarios of several useful R packages. At last but not least, this report emphasizes the suitable exception handling and parallelization methods, which is crucial for the construction of a robust and high performance web scraper with R.

Keywords: R; web scraping; web crawling

Linear and Circular Layouts for Network Visualization

Presented at Computational Biology & Drug Design (CBDD) Group, Central South University. March 29, 2012.

Introduction to the linear and circular layouts for network visualization.

Visualization of CRAN Package Dependency Network

Presented at 2010 PKU Visualization Summer School. Peking University, Beijing. August 18, 2010.

Final project presentation of our group for the 2010' visualization summer school in Peking University.

© Nan Xiao 2017
[email protected]