Monday, March 12, 2007

You say tomato, I say $tomato

When I first started programming thirty or so years ago, my choice of language was simple: machine code or bust. I didn't like machine code much, so I never wrote very much. A pattern, however, was established which would be maintained for a long time. A limited set of computer languages would be available at any one time, and I would pick the one that I liked the best and work solely in that. Machine code gave way to assembler (never mastered) to BASIC to APL to Pascal. Transitions were short and sweet; once a better language was available to me, I switched completely. A few languages (Logo, Forth, Modula 2) were contemplated, but never had the necessary immediate availability to be adopted.
A summer internship tweaked the formula slightly -- at work I would use RS/1, because that's what the system was, but at home I stuck to Pascal. For four years of college this was the pattern.

Grad school was supposed to mean one more shift: to C++. However, soon I discovered the universe of useful UNIX utility languages, and sed and awk and shell scripts started popping up. Eventually I discovered make, which is a very different language. A proprietary GUI language based on C++ came in handy. Prolog didn't quite get a proper trial, but at least I read the book. Finally, I found Perl and tried to focus on that, but the mold had been broken -- and for good measure I wrote one of the worlds first interactive genome viewers in Java. My thesis work consisted of an awful mess of all of these.

Come Millennium, I swore I would write nothing but Perl. But soon, that had to be modified as I needed to read and write relational databases, which requires SQL. Ultimately, I wanted to do statistics -- and these days that means R.

There are a number of computer language taxonomies which can be employed. For example, with the exceptions of make, SQL (as I used it) and Prolog all of these languages are procedural -- you write a series of steps and they are executed. The other three fit more of a pattern of the programmer specifying assertions, conditions or constraints and the language interpreter or compiler executes commands or returns data according to those specifications.

Within the procedural languages, there is a lot of variation. Some of this represents shared history. For example, C++ is largely an extension of C, so it shares many syntactic features. Perl also borrowed heavily from C, so much is similar. R is also loosely in the C syntax family. All of these languages tend to be terse and heavily use non-alphabetic characters. On the other hand, SQL is intrinsically loquacious.

The fun part is when you are trying to use multiple languages simultaneously, as you must keep straight the differences & properly shift gears. Currently, I'm working semi-daily in Perl, SQL and R, and there is plenty to catch me up if I'm napping. For example, many Perl and R statements can interchange single and double quotes freely -- as long as you do so symmetrically; SQL needs single quotes around strings.
Perl & R use the C-style != for inequality; SQL is the older style <> and in paralled Perl & R use == for equality whereas SQL uses a single = -- and since a single = in Perl is assignment, forgetting this rule can lead to interesting errors! R is a little easier to keep straight, as assignment is <- . R and Perl also diverge on $ -- for Perl it precedes every single value (scalar) variable, whereas in R it specifies a column of a table. I haven't done C++ or Java for over ten years, but my mind still wants to parse an R variable foo.bar as bar is a member of class instance foo (perhaps because that's the SQL idiom as well), but in R the period is just another legal character for composing a name -- and in Perl it's yet another syntax ( ->{'key'} ) to access the members of a class.

While I know all the rules, inevitably there is a mistake a day (or worse an hour!) where my R variables start growing $ and I try to select something out of my SQL using != . Eventually my mind melts down and all I can write is:
select tzu->{'name'},shih$color from $shih,$tzu where shih.dog==tzu.dog

which doesn't work in any language!

4 comments:

Michael Barton said...

A great post Keith.

I think the difference between computational science ans computer science is that we use computers to do science rather than studying computers as a science.

I write computer code to get something done. I code something up to get a result, the actually script is then disposable. I don’t go through a design process and build a application. I think this is reflected in the use of fast scripting languages such as R and Perl in bioinformatics, as opposed to application building languages such as C++ and Java.

Anonymous said...

I agree with Mike, as a bioinformatician the intention in a computational science is to write the computer code to get things done.

But as a computer scientist I always remember the cliche, "don't reinvent the wheel." Which applies more to the biologists. I believe as someone who is knowledgeable in programming that it is my duty to take the extra time to build useful and well commented "wheels" to help the less programming savy out there. This way you don't have biologists halting their science trying to learn to program.

As far as the original post, I think his description of switching from language to language over the years is the take home message for biologists out there learning programming. I can sit here all day and post why you should do everything in Perl or C but the end its up to the comfort and skill level of that programmer. But as the OP eluded to, once you know one programming language adopting another is relatively straight forward, its just the little syntax quirks that you have to get used to.

Amit said...

On any given day I will use perl, C, sql, bash, and awk. Earlier this year when I was doing a lot C, and then had to write some perl, I actually typed make to run my perl script!

Anonymous said...

It's the single & and | in R versus pretty much every other language that always catches me.