Which Language Should I Learn?

For many new students and practitioners of data science the question of which programming language to learn can be a source of angst. Learning a programming language well is a time-consuming task and understandably, no one wants to waste their time on the "wrong" language.

At this point, it's probably best to make a distinction between people who are formally trained in programming and computer science and people who aren't. Formal training definitely includes on-the-job training or any other structured learning format with credible and widely-accepted validation processes. It's definitely possible to self-teach and many people do given all the resources that are available. Taking this route though requires some kind of independent validation of what a person knows. Industry certifications are standard tools that anyone can use to demonstrate programming knowledge and proficiency. Some certifications carry more weight than others so it's important for the aspiring data scientist to do some research before spending sometimes large amounts of money for training.  

A formally-trained or experienced programmer typically knows all the major languages and a few obscure ones. Further, one of the main advantages of formal training, particularly in programming languages, is that you have a fundamental understanding of how languages are structured and therefore have relatively little difficulty learning new ones quickly. This blog entry is primarily for the other set of people—the newbies. 

Let's look at a small code sample that introduces some of the topics discussed later in this blog and in our DataSci Portal.  Without peeking ahead, which programming language is this code written in?

public static void procedure()
{
        List list = Arrays.asList("one", "two", "three", "four", "five");
        list.forEach(n -> System.out.println(n));
}

The answer is not particularly important, but there are immediate markers that most experienced programmers would notice.  There is also an aspect of this code that might surprise some of those same programmers.  Some people might be surprised to find out that this is essentially an iterator or looping function written in Java. The procedure uses a Lambda expression, which was recently introduced in Java 8 and more resembles an old functional language like Lisp than the C family of languages that Java came from.  The important take away for the new programmer is that this method as they are called in Java has an access type modifier—public static, a type declaration—List, and a return type—void. Given that information, particularly the type declaration, you could have eliminated whole classes of programming languages if you weren't exactly sure of what you were looking at.

For reference, let's look at the same for loop written in Java and several other languages:

Snippet 1:

public static void method()
{
        List<String> list = Arrays.asList("one", "two", "three", "four","five");
        for (String n: list)
        {
            System.out.println(n);
        }
}
 

Snippet 2:

def function () {
    val numbers = List("one", "two", "three","four","five")
    numbers.foreach( println )
}
 

Snippet 3:

list <- c("one","two","three","four","five")
for(i in list){
  print(i)
}

 

Snippet 4:

(for-each (lambda (x) (newline) (display x))
          (list "one" "two" "three" "four" "five"))

 

Snippet 5:

list = ['one', 'two', 'three', 'four', 'five']
for item in list:
    print(item)

Try to determine which languages are used.  Hint: I may have included a snippet from a language that isn't currently available in our DataSci Portal. Better yet, try to run some of this code. Keep in mind that the code snippets above are just that—snippets.  In some cases you'll need to add a few lines of code to get actual output, which will be the same for all the snippets. 

So you might ask, why is the above exercise relevant for data scientists?  As talked about here, a data scientist spends a lot of time pre-processing data in building useable data sets.  Often this requires analyzing and parsing containers of information or more formally, data structures, in an automated way.  Data structures can contain almost any type of information.  This is likely a challenging departure from the nicely formatted numbers that non-programmer analysts may be used to working with. Hopefully the newbie can start to see how you might begin to go about getting a handle on "messy" data.  

An interesting example that I didn't include above is R's relatively new foreach iterator, which is similar to R's lapply function.  It allows for parallel execution over many cores on one computer or on multiple nodes in a distributed framework. This added functionality is useful when dealing with really large data sets or high performance computation applications like simulation.  Look here for more information.

Generally speaking and for a novice programmer or data scientist, I would say there is no such thing as picking the "wrong" language. Programming languages can be thought of as economic agents. Meaning they face many of the competitive forces that agents in a competitive marketplace face. A language has to offer tools that people want and need or face obsolescence. It's not then surprising that the developers of Java introduced Lambda expressions like the one above given that a major competing language—Python—has long offered this functionality. 

All the major all-purpose programming languages are going to have vast libraries that enable all the functionality that you might want and more. In addition, the all-purpose languages have vast communities from which to find varying levels of support. Domain-specific languages are also part of this competitive landscape and therefore, more and more, they incorporate functionality that blurs the lines between them and all-purpose programming languages. For clarification, C, Java, C++, Python, and Scala are examples of what most people would consider all-purpose languages. SQL, R, and Bash are examples of what I consider domain-specific languages. Again, distinctions like these are often a matter of opinion and perspective.

Compiled languages, like C, are generally faster than interpreted languages, but optimizations and, in some cases, different implementations can be and have been made that make the performance of interpreted languages like Python and hybrid languages like Java comparable to "fast" languages. This is another case where distinctions matter less and less over time because many "interpreted" languages compile source code into bytecode.

To be sure, some languages can have higher learning curves at the onset, but I argue that language choice and aptitude often comes down to personal preference and situational requirements. Domain-specific languages are definitely an appropriate choice if someone  knows exactly what they expect to do with the language or in some cases know that a company that they expect to work for or with uses a specific language. So for an aspiring statistician, choosing R or SAS is a smart choice.

If you don't fall into that formally trained category, my recommendation is to just choose one of the major all-purpose programming languages and learn it well. We lean towards open-source frameworks, but be mindful that there can be direct or indirect costs with these kinds of choices. As far as the language of data science, as you can see from the offering in our DataSci Portal, no one language is recommended over another. I use whichever language or framework that makes life easier for me for a given project. In short, don't fret over such choices unless there are compelling reasons to do so.