Why Everyone Should #DeletePython

Nan Xiao 2018-04-01 7 minute read

Disclaimer: this article reflects my personal opinion only. Reader discretion is advised.

How BMO changes batteries (Adventure Time). Cute but dangerous, just like working with Python 2 and 3 simultaneously.

How BMO changes batteries (Adventure Time). Cute but dangerous, just like working with Python 2 and 3 simultaneously.

OK. There I said it. Everybody should consider #DeletePython.

Recently, people have raised significant concerns regarding Facebook’s privacy protection practice on their user’s data, with the discussions on Twitter: #DeleteFacebook. This inspired me to say something about another spicy topic: #DeletePython.

Well, I have to admit that “everyone” might be a bit exaggerating. The accurate central message would be, everyone who wants to do some decent, sustainable, modern statistics, should consider R, but not Python, as the primary language.

I know this may offend some Python lovers, but I do have my reasons, explained below.

Some context

For almost ten years, I have been doing almost everything that involves programming without major bottlenecks in R. What I have done includes frequentist/Bayesian statistics, (statistical) machine learning, recommender systems, bioinformatics, cheminformatics, data visualization, creating RESTful API, and building web applications. That gives you an idea that there are few limits on what you can achieve with R.

As the challenger, I believe Python is a solid general purpose programming language that could also help people do these things well. Notably, for “data science” related tasks, many people from computer science backgrounds probably see Python as a nice, free replacement for the good old MATLAB (may it rest in peace) for doing machine learning. People came from pure software engineering backgrounds might also want to try some statistical modeling in such a language that they are already familiar.

However, all things being equal, I have to point out that Python also has seemingly trivial but obvious issues that indeed blocks me to make myself a contributor to its ecosystem, or even consider it as a serious language, at least for the “data science” purposes.

1. A tale of two versions

Every time I think about using specific code written in Python, instead of performance considerations, the first question popped into my head would be: which Python version should I use, 2 or 3? As is known to most people, Python has this ridiculous issue of two parallel versions/ecosystems with significant incompatibility for a relatively long time.

In some Linux distributions, python means Python 3, python2 means Python 2; while in other distributions, python means Python 2, you need python3 to call Python 3. Sometimes, it is even more complex, for example, if you use Homebrew Python, you will be essentially dealing with python, python2, and python3, along with pip, pip2, pip3.

Recently, there have been genuine improvements following the progress made by the better adoption of Python 3, while this could still be a pain point for using the legacy code written in 2 together with the more recent modules written in 3. While in R, there is one and only one latest stable version, and we would always recommend installing this version.

2. Scattered package management

One side effect of the two interpreter versions is the chaotic version/package/environment management. Tools like pyenv, virtualenv, or conda were created to alleviate such issues. However, I think they made things worse for me eventually. Most of the time we only need a latest, working interpreter with the latest packages to run the code, it is that simple — we do not want to learn someone else’s configurations or fire up some completely different package manager just for reusing other people’s code.

The version incompatibility and package management issues would almost surely create technical, even political problems within large organizations. As many statistical procedures were designed to minimize unnecessary communications and reduce the error-prone human factors, such issues should be considered harmful for statistical practice.

3. Shallow copy for mutable objects

One of the most common operations when you code with data, is copying one object to a new object, then modify the new object. To do this in R, we merely need to do b = a. When we modify b, a will not be affected. Life is beautiful.

In contrast, Python decided to perform shallow copy for its mutable objects. Every time, you will need to remember to use special ways to achieve such a copy operation, for example, b = a[:] or b = copy.deepcopy(a). If not done correctly, you could end up with some unexpected bugs in your code. Believe me, when we say copy something, we mean to copy it, and surely do not want the modifications for the new object to affect the original one, especially if you do this every day.

On a relevant note, as a functional programming language, R’s built-in copy-on-modify semantics made it much, much safer to modify values, and improves the maintainability of the code. As long as you are not too careless or not using vectorized code at all, there will not be significant performance issues. In case you want to optimize, you will be rewarded for delicate mathematical and vectorized thinking, which is pretty cool. Eventually, such functional designs save human time — the more significant bottleneck in the long run.

4. No native data structures for statistical thinking

Python does not have built-in data structures suitable for doing statistics. You will almost surely need to use third-party data structures such as NumPy array or pandas “DataFrame”. Despite their interoperability and performance issues, if you used them, you would need the other people to use the same libraries (sometimes the same versions…) to reuse and understand your code. It is not something particularly convenient.

In contrast, I have trusted the vanilla data frame in R and all the robust base functions for a long time, not to mention the powerful tidyverse extensions (dplyr, pipes). In fact, the abstraction of vector, matrix, data frame, and list is brilliant. These data structures are designed and forged in a way that they eventually became a natural part of the solution when people think and solve statistical problems, without thinking much about how exactly they need to store the data or compute on the data.

Beyond that, I also love the vector-oriented design and thinking in R. Everything is a vector: factors are special vectors; matrices and arrays are vectors with dimension attributes; lists are recursive vectors, data frames are special lists, thus special vectors. After a while, you will realize that treating vector as the atomic data structure and building everything on it, fits the paradigm of statistical thinking perfectly.

5. No decent IDEs, ever.

I have personally tried almost every mainstream Python IDEs, or editors with extensions. The conclusion is sad: none of them are good enough. Maybe my bar is too high, but they all seem to have some common issues, such as lacking essential features for working with data (think an object inspector). The most critical problem is, most of them simply have bad tastes and need better aesthetics in design.

For R, I do not think there is much need to explain: RStudio is perfect in almost all aspects. It is good even if you compare it with the many IDEs for many other languages. Even before RStudio, we had RKward, a full-featured beautiful R IDE from the KDE project.

Closing remarks

Thanks to a lot of people’s continuous efforts on improving interoperability, we could now work with R without having to work with Python directly, when it is needed to use some Python code/packages. For example, the recent R package reticulate allows you to wrap and use Python modules (such as TensorFlow and Keras) in R, and evaluate Python code in R Markdown documents seamlessly.

I firmly believe that the tool you choose will, in turn, influence the way you think. As Prof. Norman Matloff mentioned, R is written by statisticians, for statisticians. One example for this is the formula interface [1], and such stylish designs are often deeply embedded in R. If you do not have access to such things, you will probably lose the chance to think or do statistics in a more elegant way. Thus my point: if you want to do decent, sustainable, and modern statistics, you might want to start with R, but not Python.

[1] Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic description of factorial models for analysis of variance. Applied Statistics, 392–399.