data science practitioner
machine learning researcher
Abstract: We have accumulated numerous excellent software packages for analyzing large-scale biomedical data on the way to delivering on the promise of human genomics. Bioconductor workflows illustrated the feasibility of organizing and demonstrating such software collections in a reproducible and human-readable way. Going forward, how to implement fully automatic workflow execution and persistently reproducible report compilation on an industrial-scale becomes challenging from the engineering perspective. For example, the software tools across workflows usually require drastically different system dependencies and execution environments and thus need to be isolated completely. As one of the first efforts exploring the possibility of bioinformatics workflow containerization and orchestration using Docker, the DockFlow project aims to containerize every single existing Bioconductor workflow in a clean, smooth, and scalable way. We will show that with the help of our R package liftr, it is possible to achieve the goal of persistent reproducible workflow containerization by simply creating and managing a YAML configuration file for each workflow. We will also share our experience and the pitfalls encountered during such containerization efforts, which may offer some best practices and valuable references for creating similar bioinformatics workflows in the future. The DockFlow project website: https://dockflow.org.
Poster of the final project for the class HGEN 48600/STAT 35450 (Fundamentals of Computational Biology: Models and Inference) at The University of Chicago in 2016. We explored the hypothesis if there is color topics used in the visual design of movie posters with the generative model STRUCTURE, or namely, Latent Dirichlet Allocation.
Abstract: liftr extends the R Markdown metadata format. It helps you generate Dockerfile for rendering R Markdown documents in Docker containers. Users can also include and run pre-defined Rabix tools/workflows, then analyze the Rabix output in the dockerized R Markdown documents.
Invited talk at the 10th China R Conference. Tsinghua University, Beijing, China. May 20, 2017.
Invited talk at DockerCon 2017. Austin, TX. April, 2017.
Invited talk at Boston R/Bioconductor for Genomics Meetup. Dana-Farber Cancer Institute. January 12, 2017.
Invited talk at Shiny Developers Conference. Stanford University. January 30, 2016.
Invited talk at 2015 Bioinformatics Workshop. Center for Research Informatics, The University of Chicago. December 3, 2015.
Abstract: We introduced the modern concepts, principles, tools, and challenges in reproducible (computational) research at the workshop. With some coverage of the following topics:
Invited workshop (joint with Dan Tenenbaum & Tengfei Yin) at BioC 2015. Fred Hutchinson Cancer Research Center, Seattle, WA. July 21, 2015.
Abstract: We will introduce common workflow language and R package cwl, the implementation with Rabix , then a demo about how to write R command line tool with docopt, how to convert your R command line tool to CWL, how to use rabix R package's R interface to describe your tool, and use Rabix to develop, deploy and run it on AWS cloud with SBG platform or run it locally. We will also demonstrate dockerizing R Markdown documents with Rabix support using the liftr package; automating a workflow from raw data uploading, pipeline running, and report retrieving with the sbgr API package.
Presented at the Computational Biology & Drug Design (CBDD) Group, Central South University. December, 2013.
Abstract: The need for appropriate methods to measure the similarity between data points is urgent in machine learning research, but handcrafting good metrics for specific problems is difficult. This has led to the emergence of supervised distance metric learning, which aims at automatically learning a metric from data, for the past decade. The talk gives a review of the successful methods in the field of supervised distance metric learning, discussed the pros and cons of each approach, especially RCA, NCA, ITML and LMNN.
Keywords: distance metric learning
Invited talk at the 6th China R Conference. Renmin University of China, Beijing. May 18, 2013.
Abstract: The web itself is the world's largest, public-accessible data source. Knowing how to scrape data from the web has become one must-have skill, particularly for data hackers. In this report, you will learn the basic coding strategies and neat tricks for web scraping with R. While introducing how to retrieve data from the web and parse a variety of data formats, we will summarize the usage and application scenarios of several useful R packages. At last but not least, this report emphasizes the suitable exception handling and parallelization methods, which is crucial for the construction of a robust and high performance web scraper with R.
Keywords: R; web scraping; web crawling
Presented at Computational Biology & Drug Design (CBDD) Group, Central South University. March 29, 2012.
Introduction to the linear and circular layouts for network visualization.
Presented at 2010 PKU Visualization Summer School. Peking University, Beijing. August 18, 2010.
Final project presentation of our group for the 2010' visualization summer school in Peking University.