Jonathan Law

View Original

Why I Will Always Choose Python over R

There is an on-going debate in the data science community, and it's often one of the first questions a new data scientist has to ask themselves: Should I use/learn Python or R? These two languages are the powerhouses of data science (though I see Julia coming to eventually take a bit of that market share), and for good reason. Both are excellent programming languages for the average data scientist, with the ability to run statistical analyses, perform complicated data cleaning, and implement machine learning models. The decision between the two can seem fairly difficult, as there aren't many things that one language can do that the other cannot, and a lot of (virtual) ink has been spilled discussing the pros and cons of each. As someone who has experience with both languages, I'd like to humbly throw in my two cents into the debate and tell you why I will always choose Python.

I'll state my bias upfront: I absolutely love Python. It's the language that made me love coding and writing Python is often as easy as writing English to me. But I also have experience coding in a lot of other languages, and I think that I can apply a neutral lens when talking about the advantages and disadvantages of each.

What's the Difference?

Let's start with R.

R is, first and foremost, a statistical analysis language. As the R website says: "R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques". Outside of data science, R is typically used by academics who need to perform complicated statistical analyses on the data gathered from scientific studies, as well as pollsters looking to mine data leading up to an election. R still has tools for machine learning and even has a version of Tensorflow, but its focus will always remain on statistics and data mining.

Python is a general-purpose scripting language that supports multiple programming paradigms. While R is primarily for statistics and can also be used for general purpose programming, Python can be used for almost anything, including statistical analyses. It's one of the most pervasive languages, as it can be used for anything from web scraping to web hosting to machine learning to scientific calculations.

Why I Choose Python

Generality

Python's generality is one of the main reasons that I will always choose Python over R. While it might seem that it would be better to use a specialized language for specialized work, the work of a data scientist usually isn't restricted to pure data analysis. In a single project, I've had to:

  1. Pull data from a SQL database
  2. Perform various data cleaning and formatting
  3. Assemble the data into a network graph (using NetworkX) and perform a network analysis on it
  4. Group the graph into clusters
  5. Scrape a website for textual data
  6. Use NLP on the textual data
  7. Analyze the profiles pulled from the SQL data
  8. Assemble all of the above information into a recommendation system and push the recommendations to a cloud store

I've skipped several steps for the sake of brevity, but you get the picture. Data analysis was a significant part of my work, but without a lot of other moving parts, I wouldn't have been able to access the data, or I wouldn't have it in the form that I needed it in. And while R might have a solution for all of these steps, Python often only requires a few lines to complete the tasks.

Operationality

This speaks to what I call the operationality of Python: the ease of moving data in and out of your Python processes is extremely simple, and connecting to broader systems is much easier than doing so in R. This means that you can easily begin working at a new company with an entirely different IT infrastructure and still be up and running with only a little bit of experimentation. Python is used in finance, healthcare, tech companies like Instagram and Spotify, and many, many others. If you can master Python, you'll have a place a very large number of companies looking for your skills. This is much less the case for R, which is usually only thrown into data science job postings as a "nice-to-have" or as a second tier skill.

Python also has strong tie-ins with C and C++, which allows for, with a bit of tweaking, blazing fast computation speed for high-powered computation.

The Tenets of Python

Finally, in terms of ease of use, Python's core tenets and philosophies make writing code in Python a very smooth process. According to "The Zen of Python" by Tim Peters, "There should be one - and preferably only one - obvious way to do it". This means that, when writing Python, everything is extremely regular and repeatable. You can often predict the behavior of a certain line of code using a library you've never used before, simply because the "Tao of Python" makes its behavior fairly reliable. This isn't necessarily the case for all Python packages (this is impossible, given the open source nature of Python) but in comparison to R, it's fairly standardized.

When I'm writing R (which I'll admit, I'm not as familiar with), I often find myself searching for the same questions over and over, because there are often many ways to do the same thing, all equally correct. What stuck out to me today, and inspired this blog post, was the simple import statement. In R, both library(dplyr) and library("dplyr") are valid and perform in exactly the same way. Although dplyr without quotes would usually be a variable, R seems to understand it as a string while within the library command. This is only a simple example, but this variability of behavior pervades R and makes it difficult to work with.

Why You Would Still Use R

Despite all of the arguments above, there are reasons that many people still stick with R over Python. As I said, R is extremely useful for statistical analysis. Creating linear regressions and histograms in R is extremely easy in comparison to Python, and often doesn't even need any external libraries. If you're in the area of academia, where R (along with SPSS and SAS) is an industry standard, or if your job requires quite a bit of statistical analysis and most of the external connections are handled elsewhere, you should stick to R, as it's the superior tool in that realm.

Why You Should Switch To Python (if the above doesn't apply)

Python is ubiquitous. Python is consistent. Python is simple and easy-to-learn while still being extremely powerful. Python can connect to almost any existing system, manipulate data in various ways, perform statistical and scientific operations at scale, use parallel processing, and can be used for an enormous breadth of problems and operations. Most people's answer to the choice of Python or R is "It depends on what you're using it for," and while this is try for the most part, Python just has so much more to offer than R, and lacks many of the significant hangups.

Python has its weaknesses, sure. But when I'm given the choice between Python and R, I will choose Python every time.