Nan Xiao

R Developer
Data Science Practitioner
Machine Learning Researcher

Persistent Reproducible Reporting with Docker and R

Invited talk at the 10th China R Conference. Tsinghua University, Beijing, China. May 20, 2017.

Abstract: Automatic report generation has a massive number of use cases for reproducible research and commercial applications. Fortunately, most of the problems involved in this topic have been elegantly solved by knitr and the R Markdown specification for the R community. However, the issues on data persistence and operating system-level reproducibility were rarely considered in the context of reproducible report generation. Today, such issues have become a major concern in the current software implementations. In this talk, we will discuss potential approaches to tackle such problems, particularly with the help of modern containerization technologies. We will also demonstrate how to compose a persistent and reproducible R Markdown report with the help of the two R packages we developed: docker-r and liftr. Specifically, you will learn to dockerize your existing R Markdown documents, how to apply it to the analysis of petabyte-scale cancer genomics data on the Cancer Genomics Cloud, and how to distribute or reuse such containerized reports.
Reproducible Dynamic Report Generation with Docker and R

Invited talk at DockerCon 2017. Austin, TX. April, 2017.

Abstract: Automatic report generation is extensively needed in reproducible research and commercial applications. However, operation system-level reproducibility is still a huge concern in the current implementations. I'm going to demonstrate how easy it is to write a dynamic and reproducible report with the help of Docker, Docker API R client package, and the R package liftr we developed. Specifically, you will see how to dockerize your existing R Markdown documents, with applications to the analysis of petabyte-scale cancer genomics data, and the potential to distribute and reuse such reports.
Cancer Genomics Cloud & R: Find, Access, and Analyze Petabyte-Scale Cancer Genomic Data on the Cloud

Invited talk at Boston R/Bioconductor for Genomics Meetup. Dana-Farber Cancer Institute. January 12, 2017.

Introduction of Cancer Genomics Cloud and the R client package sevenbridges. Boston Bioconductor Meetup in January 2017.

High-Dimensional Survival Modeling with Shiny

Invited talk at Shiny Developers Conference. Stanford University. January 30, 2016.

Abstract: In the talk, we will demonstrate, a Shiny application for high-dimensional survival modeling. The application supports automatic model building, validation, calibration, comparison, Kaplan-Meier analysis of risk groups, and reproducible report generation. With the application, physicians and clinical researchers can build prognostic models, validate model performance, and prepare publication-quality figures easily, in a fully reproducible way.
Introduction to Reproducible Research in Bioinformatics

Invited talk at 2015 Bioinformatics Workshop. Center for Research Informatics, The University of Chicago. December 3, 2015.

Abstract: We introduced the modern concepts, principles, tools, and challenges in reproducible (computational) research at the workshop. With some coverage of the following topics:

  1. Workflow automation (GNU make & workflow systems);
  2. R & Python packages;
  3. knitr & IPython Notebook;
  4. Version control system: git & GitHub;
  5. Package dependency management: packrat & virtualenv;
  6. System dependency management: Docker & liftr.
liftr & sbgr kickstart

Invited workshop (joint with Dan Tenenbaum & Tengfei Yin) at BioC 2015. Fred Hutchinson Cancer Research Center, Seattle, WA. July 21, 2015.

Abstract: We will introduce common workflow language and R package cwl, the implementation with Rabix , then a demo about how to write R command line tool with docopt, how to convert your R command line tool to CWL, how to use rabix R package's R interface to describe your tool, and use Rabix to develop, deploy and run it on AWS cloud with SBG platform or run it locally. We will also demonstrate dockerizing R Markdown documents with Rabix support using the liftr package; automating a workflow from raw data uploading, pipeline running, and report retrieving with the sbgr API package.

Supervised Distance Metric Learning: A Retrospective

Presented at the Computational Biology & Drug Design (CBDD) Group, Central South University. December, 2013.

Abstract: The need for appropriate methods to measure the similarity between data points is urgent in machine learning research, but handcrafting good metrics for specific problems is difficult. This has led to the emergence of supervised distance metric learning, which aims at automatically learning a metric from data, for the past decade. The talk gives a review of the successful methods in the field of supervised distance metric learning, discussed the pros and cons of each approach, especially RCA, NCA, ITML and LMNN.

Keywords: distance metric learning

Web Scraping with R

Invited talk at the 6th China R Conference. Renmin University of China, Beijing. May 18, 2013.

Abstract: The web itself is the world's largest, public-accessible data source. Knowing how to scrape data from the web has become one must-have skill, particularly for data hackers. In this report, you will learn the basic coding strategies and neat tricks for web scraping with R. While introducing how to retrieve data from the web and parse a variety of data formats, we will summarize the usage and application scenarios of several useful R packages. At last but not least, this report emphasizes the suitable exception handling and parallelization methods, which is crucial for the construction of a robust and high performance web scraper with R.

Keywords: R; web scraping; web crawling

Linear and Circular Layouts for Network Visualization

Presented at Computational Biology & Drug Design (CBDD) Group, Central South University. March 29, 2012.

Introduction to the linear and circular layouts for network visualization.

Visualization of CRAN Package Dependency Network

Presented at 2010 PKU Visualization Summer School. Peking University, Beijing. August 18, 2010.

Final project presentation of our group for the 2010' visualization summer school in Peking University.