Regarding Greg´s post about how to organize plotting code for data. This is a common issue encountered regardless of who collected the data or how the data was generated. I´m an experimental neuroscientist in that I collect the data that I then analyze to test hypotheses and models. After some thrashing about I´ve kind of settled with the following design (parts in common with Greg)
1. Analysis and plotting are separate. Analysis often takes a lot of CPU time, whereas plotting doesn´t. A given analysis can be plotted in many different ways and often I want to tweak plots. I don´t want to recompute the data each time. So a pragmatic way is to save the analysed data as a pickle file and have the plotting code load it.
2. Analysis code is written to be run non-interactively using the command line options package to pass parameters/instructions. Useful when I want to run the code on remote machines, or parallelize the code.
3. No GUIs. This has saved me so much time. I just write plotting code that pops up (or saves as pdf) one figure according to command line options. If I need a new type of figure I just copy the code into a new script/module and save it separately. This is much easier to debug than interactive GUIs that do a gazillion things.
4. Source control. Don´t delete any code, save it under different folders organized by idea or by date. I've always found myself asking, months later, I made a plot like this, where is it, I want to see what I did there.
That's the current credo that has helped me waste a little less time when I want to test an idea with my data.
Date: Mon, 21 Dec 2009 17:42:40 -0500
From: Greg Novak <novak@...354...>
Subject: [Matplotlib-users] Best practices for organizing plotting
Content-Type: text/plain; charset=ISO-8859-1
I do computational science and I think I'm typical in that I've
accumulated a huge pile of code to post-process simulations and draw
plots. I think the number of lines of plotting code is now greater
than the number of lines in the actual simulation code... The problem
with plotting code is that so much of it has such a short
lifetime---you have an idea, spend some time writing code to draw the
relevant plot, then the plot isn't interesting and you delete the
code. Therefore there's little incentive to spend any time making
sure that plotting code is at all well-designed. Nevertheless, _some_
of it tends to live a long time and get ever more complicated---then
the lack of design becomes ever more painful as time goes on. You
simply don't know at the beginning which code will be thrown away and
which will live a long time.
Over the years I've developed my favorite way to organize my plotting
code but it's far from perfect and I'd love to gather ideas from the
MPL community. So, my current "design principles" are basically
1) Don't over-design. A simple system that's used consistently is
better than a half-implemented complicated system. Furthermore, most
plotting code gets thrown away, so keeping overhead down is one of the
2) Keep computation separate from plotting wherever possible.
Therefore I have functions like "def compute_optical_depth(...)" that
compute the physical quantities to be plotted and "def
plot_optical_depth(...)" that handle everything about the visual
appearance of the plot. Then when I want to draw some other plot
involving optical depth, the calculation is neatly packaged into a
3) Keep annotation, axis labels, legends, etc, separate from the code
that actually draws the lines on the axes. This allows you to compose
plots to a certain extent. I often find myself saying "I want plot B
to look just like plot A but with this extra information, extra lines,
extra annotation, or whatever" If the function that draws plot A just
puts the data on the axes without axis labels, etc, then the function
that draws plot B can easily use it directly. If the function that
draws plot A _also_ draws a bunch of annotations and labels, then the
function that draws plot B must either get rid of them or hope they
still make sense in the new context.
4) Don't put clf() and cla() all over the place. When working
interactively, it's very tempting to put clf()'s into every function
that draws a plot in order to save a few keystrokes. However, plots
don't know the context into which they're being drawn, therefore they
have no authority to clear the screen. They may "own" the whole
plotting window, or they may be incorporated into a larger context.
The function that worries about axis labels, annotations, and titles
is allowed to call cla(). The function that worries about subplots is
allowed to call clf(). If you might use the code over a slow link
(e.g. connecting to a supercomputing site via residential DSL) then no
function should call draw() -- that's the user's job.
The upshot of these is that I end up with four layers of functions:
1) compute_physical_quantity(...): just handles numbers
2) draw_physical_quantity(...): has calls to pylab.plot() handling
colors, linestyles, etc, but not annotations
3) some_plot(...): has calls to draw_physical_quantity(),
some_related_physical_quantity(), along with axis labels, annotations,
legends, and pylab.cla()
4) some_figure(...): has multiple panels with calls to
pylab.subplot(), pylab.clf(), some_plot_a(), some_plot_b(), etc.
Sometimes layers 2 and 3 are combined because I'm lazy if layer 2
would really be just a single call to pylab.plot.
Please remember that I'm not writing these down because I think
they're so great that everyone needs to know about them. I'm hoping
that people will respond with much better ideas that I can adopt for