Best practices for organizing plotting code? (Greg Novak)

Hi,

Regarding Greg´s post about how to organize plotting code for data. This is a common issue encountered regardless of who collected the data or how the data was generated. I´m an experimental neuroscientist in that I collect the data that I then analyze to test hypotheses and models. After some thrashing about I´ve kind of settled with the following design (parts in common with Greg)

1. Analysis and plotting are separate. Analysis often takes a lot of CPU time, whereas plotting doesn´t. A given analysis can be plotted in many different ways and often I want to tweak plots. I don´t want to recompute the data each time. So a pragmatic way is to save the analysed data as a pickle file and have the plotting code load it.

2. Analysis code is written to be run non-interactively using the command line options package to pass parameters/instructions. Useful when I want to run the code on remote machines, or parallelize the code.

3. No GUIs. This has saved me so much time. I just write plotting code that pops up (or saves as pdf) one figure according to command line options. If I need a new type of figure I just copy the code into a new script/module and save it separately. This is much easier to debug than interactive GUIs that do a gazillion things.

4. Source control. Don´t delete any code, save it under different folders organized by idea or by date. I've always found myself asking, months later, I made a plot like this, where is it, I want to see what I did there.

That's the current credo that has helped me waste a little less time when I want to test an idea with my data.

Best
-Kaushik

···

------------------------------

Message: 7
Date: Mon, 21 Dec 2009 17:42:40 -0500
From: Greg Novak <novak@...354...>
Subject: [Matplotlib-users] Best practices for organizing plotting
        code?
To: matplotlib-users@lists.sourceforge.net
Message-ID:
        <ad0d4fcf0912211442x1261b84ar79945c045a1afa60@...288...>
Content-Type: text/plain; charset=ISO-8859-1

Hello,
I do computational science and I think I'm typical in that I've
accumulated a huge pile of code to post-process simulations and draw
plots. I think the number of lines of plotting code is now greater
than the number of lines in the actual simulation code... The problem
with plotting code is that so much of it has such a short
lifetime---you have an idea, spend some time writing code to draw the
relevant plot, then the plot isn't interesting and you delete the
code. Therefore there's little incentive to spend any time making
sure that plotting code is at all well-designed. Nevertheless, _some_
of it tends to live a long time and get ever more complicated---then
the lack of design becomes ever more painful as time goes on. You
simply don't know at the beginning which code will be thrown away and
which will live a long time.

Over the years I've developed my favorite way to organize my plotting
code but it's far from perfect and I'd love to gather ideas from the
MPL community. So, my current "design principles" are basically
these:

1) Don't over-design. A simple system that's used consistently is
better than a half-implemented complicated system. Furthermore, most
plotting code gets thrown away, so keeping overhead down is one of the
primary considerations.

2) Keep computation separate from plotting wherever possible.
Therefore I have functions like "def compute_optical_depth(...)" that
compute the physical quantities to be plotted and "def
plot_optical_depth(...)" that handle everything about the visual
appearance of the plot. Then when I want to draw some other plot
involving optical depth, the calculation is neatly packaged into a
function.

3) Keep annotation, axis labels, legends, etc, separate from the code
that actually draws the lines on the axes. This allows you to compose
plots to a certain extent. I often find myself saying "I want plot B
to look just like plot A but with this extra information, extra lines,
extra annotation, or whatever" If the function that draws plot A just
puts the data on the axes without axis labels, etc, then the function
that draws plot B can easily use it directly. If the function that
draws plot A _also_ draws a bunch of annotations and labels, then the
function that draws plot B must either get rid of them or hope they
still make sense in the new context.

4) Don't put clf() and cla() all over the place. When working
interactively, it's very tempting to put clf()'s into every function
that draws a plot in order to save a few keystrokes. However, plots
don't know the context into which they're being drawn, therefore they
have no authority to clear the screen. They may "own" the whole
plotting window, or they may be incorporated into a larger context.
The function that worries about axis labels, annotations, and titles
is allowed to call cla(). The function that worries about subplots is
allowed to call clf(). If you might use the code over a slow link
(e.g. connecting to a supercomputing site via residential DSL) then no
function should call draw() -- that's the user's job.

The upshot of these is that I end up with four layers of functions:

1) compute_physical_quantity(...): just handles numbers
2) draw_physical_quantity(...): has calls to pylab.plot() handling
colors, linestyles, etc, but not annotations
3) some_plot(...): has calls to draw_physical_quantity(),
some_related_physical_quantity(), along with axis labels, annotations,
legends, and pylab.cla()
4) some_figure(...): has multiple panels with calls to
pylab.subplot(), pylab.clf(), some_plot_a(), some_plot_b(), etc.

Sometimes layers 2 and 3 are combined because I'm lazy if layer 2
would really be just a single call to pylab.plot.

Please remember that I'm not writing these down because I think
they're so great that everyone needs to know about them. I'm hoping
that people will respond with much better ideas that I can adopt for
myself.

Thanks,
Greg

A couple more thoughts on this:

4) Don't put clf() and cla() all over the place.

absolutely --

my addition to this is to use the OO API more than the pylab one. Put all your plotting code into functions that take an axes object as a parameter, then go from there. That way you have separated the generation of figures (collections os axes) from the plotting itself.

you can do the same at a higher level too -- put your code that creates the figures in a function that take a figure as an argument -- then you can use the same code to generate PDFs, embed in a GUI, etc.

HTH,

-Chris

···

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@...259...

Hi,

Regarding Greg´s post about how to organize plotting code for data. This is a common issue encountered regardless of who collected the data or how the data was generated. I´m an experimental neuroscientist in that I collect the data that I then analyze to test hypotheses and models. After some thrashing about I´ve kind of settled with the following design (parts in common with Greg)

  1. Analysis and plotting are separate. Analysis often takes a lot of CPU time, whereas plotting doesn´t. A given analysis can be plotted in many different ways and often I want to tweak plots. I don´t want to recompute the data each time. So a pragmatic way is to save the analysed data as a pickle file and have the plotting code load it.

  2. Analysis code is written to be run non-interactively using the command line options package to pass parameters/instructions. Useful when I want to run the code on remote machines, or parallelize the code.

I don’t do very heavy computations that always require multiple cores to perform the analysis. Most of the time a fast-single computer is enough for my analysis-plotting needs. This said I want to comment on the last two points of your e-mail.

  1. No GUIs. This has saved me so much time. I just write plotting code that pops up (or saves as pdf) one figure according to command line options. If I need a new type of figure I just copy the code into a new script/module and save it separately. This is much easier to debug than interactive GUIs that do a gazillion things.

Sometimes GUIs simplify things a lot especially when I am doing quick-looks to the data. You can take a look at Traits [http://code.enthought.com/projects/traits/] Your opinions might change after seeing how easy to design a GUI for your needs.

  1. Source control. Don´t delete any code, save it under different folders organized by idea or by date. I’ve always found myself asking, months later, I made a plot like this, where is it, I want to see what I did there.

There is even a better approach for this. You can use web-based source-code management systems (e.g. code.google.com or www.sourceforge.net) Either way they provide great amount of flexibility for solo or multiple developer projects.

···

On Tue, Dec 22, 2009 at 7:17 AM, Ghose, Kaushik <Kaushik_Ghose@…2126…> wrote:

That’s the current credo that has helped me waste a little less time when I want to test an idea with my data.

Best

-Kaushik


Message: 7

Date: Mon, 21 Dec 2009 17:42:40 -0500

From: Greg Novak <novak@…354…>

Subject: [Matplotlib-users] Best practices for organizing plotting

    code?

To: matplotlib-users@lists.sourceforge.net

Message-ID:

    <ad0d4fcf0912211442x1261b84ar79945c045a1afa60@...288...>

Content-Type: text/plain; charset=ISO-8859-1

Hello,

I do computational science and I think I’m typical in that I’ve

accumulated a huge pile of code to post-process simulations and draw

plots. I think the number of lines of plotting code is now greater

than the number of lines in the actual simulation code… The problem

with plotting code is that so much of it has such a short

lifetime—you have an idea, spend some time writing code to draw the

relevant plot, then the plot isn’t interesting and you delete the

code. Therefore there’s little incentive to spend any time making

sure that plotting code is at all well-designed. Nevertheless, some

of it tends to live a long time and get ever more complicated—then

the lack of design becomes ever more painful as time goes on. You

simply don’t know at the beginning which code will be thrown away and

which will live a long time.

Over the years I’ve developed my favorite way to organize my plotting

code but it’s far from perfect and I’d love to gather ideas from the

MPL community. So, my current “design principles” are basically

these:

  1. Don’t over-design. A simple system that’s used consistently is

better than a half-implemented complicated system. Furthermore, most

plotting code gets thrown away, so keeping overhead down is one of the

primary considerations.

  1. Keep computation separate from plotting wherever possible.

Therefore I have functions like “def compute_optical_depth(…)” that

compute the physical quantities to be plotted and "def

plot_optical_depth(…)" that handle everything about the visual

appearance of the plot. Then when I want to draw some other plot

involving optical depth, the calculation is neatly packaged into a

function.

  1. Keep annotation, axis labels, legends, etc, separate from the code

that actually draws the lines on the axes. This allows you to compose

plots to a certain extent. I often find myself saying "I want plot B

to look just like plot A but with this extra information, extra lines,

extra annotation, or whatever" If the function that draws plot A just

puts the data on the axes without axis labels, etc, then the function

that draws plot B can easily use it directly. If the function that

draws plot A also draws a bunch of annotations and labels, then the

function that draws plot B must either get rid of them or hope they

still make sense in the new context.

  1. Don’t put clf() and cla() all over the place. When working

interactively, it’s very tempting to put clf()'s into every function

that draws a plot in order to save a few keystrokes. However, plots

don’t know the context into which they’re being drawn, therefore they

have no authority to clear the screen. They may “own” the whole

plotting window, or they may be incorporated into a larger context.

The function that worries about axis labels, annotations, and titles

is allowed to call cla(). The function that worries about subplots is

allowed to call clf(). If you might use the code over a slow link

(e.g. connecting to a supercomputing site via residential DSL) then no

function should call draw() – that’s the user’s job.

The upshot of these is that I end up with four layers of functions:

  1. compute_physical_quantity(…): just handles numbers

  2. draw_physical_quantity(…): has calls to pylab.plot() handling

colors, linestyles, etc, but not annotations

  1. some_plot(…): has calls to draw_physical_quantity(),

some_related_physical_quantity(), along with axis labels, annotations,

legends, and pylab.cla()

  1. some_figure(…): has multiple panels with calls to

pylab.subplot(), pylab.clf(), some_plot_a(), some_plot_b(), etc.

Sometimes layers 2 and 3 are combined because I’m lazy if layer 2

would really be just a single call to pylab.plot.

Please remember that I’m not writing these down because I think

they’re so great that everyone needs to know about them. I’m hoping

that people will respond with much better ideas that I can adopt for

myself.

Thanks,

Greg


This SF.Net email is sponsored by the Verizon Developer Community

Take advantage of Verizon’s best-in-class app development support

A streamlined, 14 day to market process makes app distribution fast and easy

Join now and get one step closer to millions of Verizon customers

http://p.sf.net/sfu/verizon-dev2dev


Matplotlib-users mailing list

Matplotlib-users@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/matplotlib-users


Gökhan