Senior Data Scientist. Seven Bridges Genomics, Inc. Cambridge, MA, USA.
PhD Candidate. Statistics. Central South University. Changsha, Hunan, China.
Abstract: We have accumulated numerous excellent software packages for analyzing large-scale biomedical data on the way to delivering on the promise of human genomics. Bioconductor workflows illustrated the feasibility of organizing and demonstrating such software collections in a reproducible and human-readable way. Going forward, how to implement fully automatic workflow execution and persistently reproducible report compilation on an industrial-scale becomes challenging from the engineering perspective. For example, the software tools across workflows usually require drastically different system dependencies and execution environments and thus need to be isolated completely. As one of the first efforts exploring the possibility of bioinformatics workflow containerization and orchestration using Docker, the DockFlow project aims to containerize every single existing Bioconductor workflow in a clean, smooth, and scalable way. We will show that with the help of our R package liftr, it is possible to achieve the goal of persistent reproducible workflow containerization by simply creating and managing a YAML configuration file for each workflow. We will also share our experience and the pitfalls encountered during such containerization efforts, which may offer some best practices and valuable references for creating similar bioinformatics workflows in the future. The DockFlow project website: https://dockflow.org.
BioC 2017, July 26-28, Dana-Farber Cancer Institute, Boston, MA.
Poster of the final project for the class HGEN 48600/STAT 35450 (Fundamentals of Computational Biology: Models and Inference) at The University of Chicago in 2016. We explored the hypothesis if there are color topics used in the visual design of movie posters with the generative model STRUCTURE, or namely, Latent Dirichlet Allocation.
March 8, 2016. Gordon Center for Integrative Sciences, University of Chicago.
Abstract: liftr extends the R Markdown metadata format. It helps you generate Dockerfile for rendering R Markdown documents in Docker containers. Users can also include and run pre-defined Rabix tools/workflows, then analyze the Rabix output in the dockerized R Markdown documents.
BioC 2015, July 20-22, Fred Hutchinson Cancer Research Center, Seattle, WA.
Abstract: The R package liftr aims to solve the problem of persistent reproducible reporting in statistical computing. It is one of the winners of the 2018 John M. Chambers Statistical Software Award. The R Markdown format and its backend compilation engine knitr offer a de facto standard for creating dynamic documents. However, the reproducibility of such computing environments is often limited to individual machines. It is not easy to replicate the system environment (libraries, R versions, R packages) where the document was compiled. By introducing Docker, the open source containerization technology, liftr solves this reproducibility problem. With the help of liftr, R Markdown users can quickly create and manage Docker containers for rendering their documents, thus making the computations utterly reproducible across machines and systems. liftr redefined the meaning of reproducible research by offering system-level reproducibility for data analysis for the first time and made it easier to create large-scale dynamic document building services. We will discuss the design philosophy, implementation, and applications of the liftr package.PDF Slides
Invited talk at the 10th China R Conference. Tsinghua University, Beijing, China. May 20, 2017.
Abstract: Automatic report generation has a massive number of use cases for reproducible research and commercial applications. Fortunately, most of the problems involved in this topic have been elegantly solved by knitr and the R Markdown specification for the R community. However, the issues on data persistence and operating system-level reproducibility were rarely considered in the context of reproducible report generation. Today, such issues have become a major concern in the current software implementations. In this talk, we will discuss potential approaches to tackle such problems, particularly with the help of modern containerization technologies. We will also demonstrate how to compose a persistent and reproducible R Markdown report with the help of the two R packages we developed: docker-r and liftr. Specifically, you will learn to dockerize your existing R Markdown documents, how to apply it to the analysis of petabyte-scale cancer genomics data on the Cancer Genomics Cloud, and how to distribute or reuse such containerized reports.PDF Slides
Invited talk at DockerCon 2017. Austin, TX. April, 2017.
Abstract: Automatic report generation is extensively needed in reproducible research and commercial applications. However, operation system-level reproducibility is still a huge concern in the current implementations. I'm going to demonstrate how easy it is to write a dynamic and reproducible report with the help of Docker, Docker API R client package, and the R package liftr we developed. Specifically, you will see how to dockerize your existing R Markdown documents, with applications to the analysis of petabyte-scale cancer genomics data, and the potential to distribute and reuse such reports.PDF Slides
Invited talk at Boston R/Bioconductor for Genomics Meetup. Dana-Farber Cancer Institute. January 12, 2017.
A brief introduction of the Cancer Genomics Cloud and the R API client package sevenbridges-r.
Invited talk at the Shiny Developers Conference (former rstudio::conf). Stanford University. January 30, 2016.
Abstract: In the talk, we will demonstrate hdnom.io, a Shiny application for high-dimensional survival modeling. The application supports automatic model building, validation, calibration, comparison, Kaplan-Meier analysis of risk groups, and reproducible report generation. With the application, physicians and clinical researchers can build prognostic models, validate model performance, and prepare publication-quality figures easily, in a fully reproducible way.PDF Slides Video
Invited talk at 2015 Bioinformatics Workshop. Center for Research Informatics, University of Chicago. December 3, 2015.
Abstract: We introduced the modern concepts, principles, tools, and challenges in reproducible (computational) research at the workshop. With some coverage of the following topics:
Invited workshop (joint with Dan Tenenbaum & Tengfei Yin) at BioC 2015. Fred Hutchinson Cancer Research Center, Seattle, WA. July 21, 2015.
Abstract: We will introduce common workflow language and R package cwl, the implementation with Rabix , then a demo about how to write R command line tool with docopt, how to convert your R command line tool to CWL, how to use rabix R package's R interface to describe your tool, and use Rabix to develop, deploy and run it on AWS cloud with SBG platform or run it locally. We will also demonstrate dockerizing R Markdown documents with Rabix support using the liftr package; automating a workflow from raw data uploading, pipeline running, and report retrieving with the sbgr API package.PDF Slides
Informal talk for the Computational Biology & Drug Design (CBDD) Group, Central South University. December, 2013.
Abstract: The need for appropriate methods to measure the similarity between data points is urgent in machine learning research, but handcrafting good metrics for specific problems is difficult. This has led to the emergence of supervised distance metric learning, which aims at automatically learning a metric from data, for the past decade. The talk gives a review of the successful methods in the field of supervised distance metric learning, discussed the pros and cons of each approach, especially RCA, NCA, ITML, and LMNN.PDF Slides
Invited talk at the 6th China R Conference. Renmin University of China, Beijing. May 18, 2013.
Abstract: The web itself is the world's largest, public-accessible data source. Knowing how to scrape data from the web has become one must-have skill, particularly for data hackers. In this report, you will learn the basic coding strategies and neat tricks for web scraping with R. While introducing how to retrieve data from the web and parse a variety of data formats, we will summarize the usage and application scenarios of several useful R packages. At last but not least, this report emphasizes the suitable exception handling and parallelization methods, which is crucial for the construction of a robust and high performance web scraper with R.PDF Slides
Informal talk at the Computational Biology & Drug Design (CBDD) Group, Central South University. March 29, 2012.
A brief introduction to the linear and circular layouts for network visualization.PDF Slides