The Foundations of Statistics: October 2010

One of the first problems I encounter when teaching from this book is fear. So I'd like to inaugurate this blog with a comment which may serve as an antidote.

Here is an excerpt from a very useful email that a former colleague of mine wrote to a beginning programmer. The advice here will be useful to people who want to use our book but are afraid to because they have never programmed before, or are afraid of the idea of doing `computer programming.'

  How to become self-sufficient with everyday computing issues

   The author is a colleague who shall remain anonymous
   (Note: I made up the title)

In my experience, both first and second hand, solving practical
programming issues requires some combination of one or more of the
following:

1. "can do" attitude;
2. tenacity;
3. experience from reading existing programs;
4. experience from programming;
5. experience from reasoning about programs;
6. domain knowledge about programming task (protocols, specifications).

None of these can really be taught, except for the last one. You might
get more experienced from taking a course and doing the assignments.

It's a vicious circle: you need experience to write programs, but
you'll only get it by writing programs. At some point you'll just
have to make a decision that you can do it, and then get started by
writing programs. There are many good books that you can read,
depending on what exactly you want to learn. You can't go terribly
wrong by reading anything published by O'Reilly (even the comic
books). Avoid the "For Dummies" books unless they are written by John
Levine, who has written at least one O'Reilly book. There is also
very good on-line documentation available. For example, the user's
guide for GNU Awk is perhaps the best book on Awk ever. If you want
to learn C, do not under any circumstances read Kernighan & Ritchie;
for C++, do not read Stroustrup.

Do not expect any quick results. Programming is a craft that you have
to practice. Depending on the language and task, it might take weeks,
months, or years until you can write programs fluently without
constantly pausing to consult documentation. And that's just for the
syntax, semantics, idioms, and libraries for one language. So start
with a small language. For text mangling, Awk is my first choice.
It's a very small language and you'll outgrow it sooner or later.
Scheme is a nice first general-purpose language, since it's small but
can grow. Java, C++, Perl, Python are much bigger and come with
extensive libraries, some of which you'd have to first know how to use
in order to get anything done; avoid them for now, but start looking
into Python when Awk is starting to get a bit tight under the arms.
Avoid Perl, it's Pure Evil (or at least wait for version 6), as is the
C-shell family. Avoid C++ whenever and wherever you can.

Before you get started with anything, you have to pick your weapons
carefully: what are your short-term and long-term goals? "Learning
how to program" is a very vague concept that ranges from
point-and-click assembly of components (Visual Basic) to spread sheets
to shell scripts to writing C programs, and from hacking together a
script that Does The Job to designing and documenting libraries and
components.

I'm assuming that you'll be writing mostly short programs that will be
modified often and run only a few times. In that case you want
simplicity. You want languages that hide the low-level details of the
hardware from you as much as possible. So stick with sh, ksh, Awk,
Python, Scheme, Caml, and Java. If you don't know Awk very well already,
start with that. Find the documentation on Gawk (GNU Awk), called
"Effective Awk Programming".

That's part of another exercise: there is a *lot* of information
available on the net, and plenty on your average Unix system,
including ours. Find out how to use it. You have to be
self-sufficient in that respect as much as possible, since you won't
have the time to track down somebody who can help you nor can you
afford to wait for email responses. Start with the documentation
available via Emacs, the so-called "info" pages, in addition to the
"man" pages.

You have to be able to go from there on your own. Trial and error is
another favorite strategy. I have a few "sandbox" directories whose sole
purpose it is to be a safe spot where no important data reside and where I
can try out and test things. Set aside plenty of time for this. This is
not a nuisance, but an Important Step along The Way. Be prepared to set
aside 4 hours or more for writing a 10-line script in the beginning.
You'll be able to do it in 10 minutes later, but only if you know what
you're doing. Play with it. Make sure you understand all its parts.
Write documentation, in case you want to use it again and to make sure you
know what everything is doing. Documentation is *part* of the program, it
does not belong in lab notebooks etc. State assumptions clearly, e.g.,
"expects input lines of the form <foo>:<bar>$" and write down what the
program does.

The Foundations of Statistics

Monday, October 4, 2010

On fearlessness