histogram bug

John_Hunter · December 28, 2006, 2:55pm

Adjusting zero and negative values (or maybe just zero)

    > would be unacceptable in a numerics library, but in the
    > context of our graphical transforms it is analogous to
    > clipping, and this we do all the time--we don't raise an
    > exception if someone tries to plot outside the box. (This
    > clipping strategy to handle nonpositive values is present
    > already in the LogLocator.)

I'm more comfortable dropping points than I am altering the data.
Consider some use case like

  ax.hist(...)
  rect = Rectangle(...)
  ax.add_patch(rect)
  ax.set_xscale('log')

In set_xscale we only see a bunch of rectangles. User defined
rectangles may be used to store data (eg the JPL uses custom bars with
xlimits that are the start and stop times when orbiting spacecraft are
within view of a ground station). Some of these users will also want
to "pick" the rectangle and inspect the data values. If we are
altering them, they have no guarantee that what they put in is what
they get out. Now we might argue that they get what they deserve if
they are using invalid data for a log scale, but one good solution in
my view is to fail noisily with helpful messages, and provide an easy
interface to do it right.

But if I am missing your point or you have an implementation that will
address these concerns, I'm certainly open to them. One possibility
is to flag artists that we create internally (eg hist) and take more
liberty with these, or have some flag like the Artist "clip_on"
property which allows the user to control whether their data is
mutable to support "helpful" alterations for log or other
transformations.

    > We can use such a small adjustment value that a problem such
    > as you mention above is highly unlikely--and note that
    > floating point itself has limitations, and does not permit
    > arbitrarily small or large numbers. Furthermore, note that

    > the user can always take advantage of the bottom kwarg. And
    > if in some extreme case the user has not used the bottom
    > kwarg and the bars really are shorter than the adjustment
    > value, it will probably be quite obvious.

The other thing to think about is choosing a bottom so that the
natural range of the tops is revealed. Eg if the bottom is 1e-200,
all the bars will appear the same height in many cases.

    > It is in ordinary line plotting that adjusting the value
    > could be misleading--it plots an extremely small number (if
    > the data limits are set to include it) instead of zero.
    > Maybe this is enough of a drawback to nix the whole idea.

I am happy with the current behavior of culling the non-positive
points. matlab does it and noone has complained here. There is a
difference in lines created with "plot" and bars created with hist: in
the former case the users explicitly picked the x,y points. In the
latter they implicitly do so with a default bottom kwarg that they may
have overlooked. This suggests to me that we need not treat the two
cases the same.

    > Every alternative that you propose is more complicated and
    > less comprehensive than the low-level adjustment, however,
    > and I see little if any real advantage to the alternatives.

If you would like to take a stab at an implementation I am happy to be
persuaded (with caveats below). In the simple case of

ax.hist()
ax.set_xscale('log')

this would indeed be fairly easy because you know how the data were
created. In the general case where the user has added lots-o-patches
to the axes, it may not be easy to do well. I'm still inclined to the
"explicit is better than implicit" and either require them to do one of

1) use the bottom kwarg

2) set log scaling before calling hist -- we can make the default
bottom=None and do different things for linear and log scaling

3) use a loghist function

    > If you still don't want the adjustment, then the easiest way
    > to improve the error message would be to raise a Python
    > exception instead of a c++ error in places like

    > for(int i=0; i < length; i++) { if (x<=0) { throw
    > std::domain_error("Cannot take log of nonpositive value"); }
    > else newx[i] = log10(x[i]);
    > }

> The domain error message above is informative, but it never
> makes it out to the user.

I really don't feel too strongly about this -- my gut reaction is that
a helpful message and an easy way to fix it is enough and it won't get
us into a possible quagmire of trying to be too smart. Personally, I
don't like it when computers try to be too helpful (think MS windows
and clippy); I like it when they do what I tell them to do. With the
snippet I posted previously we can easily warn them before doing the
transformation and with bottom=None we can handle the case when the
log scale is set before the call to hist. That in conjunction with
some additional docstrings in hist should work reasonably well.

That said, if you want to try something more ambitious I won't get in
your way. I recommend at a minimum that you have some artist flag
that governs whether mpl can make helpful data alterations (just as we
do with clip) so the power user can turn it off.

JDH

Eric_Firing1 · December 28, 2006, 9:39pm

John,

Thank you for your thorough and thoughtful reply. OK, I am convinced. I had not realized that the present line-drawing code actually is omitting nonpositive points, but now I see the Line.get_plottable() method.

I have committed changes to svn that I think will be helpful--maybe good enough for now. From the CHANGELOG:

2006-12-28 Improved error message for nonpositive input to log
            transform; added log kwarg to bar, barh, and hist,
            and modified bar method to behave sensibly by default
            when the ordinate has a log scale. (This only works
            if the log scale is set before or by the call to bar,
            hence the utility of the log kwarg.) - EF

Examples:

This makes a sensible plot that behaves well under zooming and panning:
hist(randn(1000), log=True)
show()

The following still generates an exception:
hist(randn(1000))
gca().set_yscale('log')
show()

but the traceback is more informative and ends with:

/usr/local/lib/python2.4/site-packages/matplotlib/patches.py in draw(self, renderer)
     183
     184 verts = self.get_verts()
--> 185 tverts = self.get_transform().seq_xy_tups(verts)
     186
     187 renderer.draw_polygon(gc, rgbFace, tverts)

ValueError: Cannot take log of nonpositive value

I have not put in any form of your suggested addition to set_yscale and set_xscale. Maybe I should, but I am hoping that the changes above are sufficient.

Eric

John Hunter wrote:

···

    > Adjusting zero and negative values (or maybe just zero)
    > would be unacceptable in a numerics library, but in the
    > context of our graphical transforms it is analogous to
    > clipping, and this we do all the time--we don't raise an
    > exception if someone tries to plot outside the box. (This
    > clipping strategy to handle nonpositive values is present
    > already in the LogLocator.)

I'm more comfortable dropping points than I am altering the data.
Consider some use case like

  ax.hist(...)
  rect = Rectangle(...)
  ax.add_patch(rect)
  ax.set_xscale('log')

In set_xscale we only see a bunch of rectangles. User defined
rectangles may be used to store data (eg the JPL uses custom bars with
xlimits that are the start and stop times when orbiting spacecraft are
within view of a ground station). Some of these users will also want
to "pick" the rectangle and inspect the data values. If we are
altering them, they have no guarantee that what they put in is what
they get out. Now we might argue that they get what they deserve if
they are using invalid data for a log scale, but one good solution in
my view is to fail noisily with helpful messages, and provide an easy
interface to do it right.

But if I am missing your point or you have an implementation that will
address these concerns, I'm certainly open to them. One possibility
is to flag artists that we create internally (eg hist) and take more
liberty with these, or have some flag like the Artist "clip_on"
property which allows the user to control whether their data is
mutable to support "helpful" alterations for log or other
transformations.

    > We can use such a small adjustment value that a problem such
    > as you mention above is highly unlikely--and note that
    > floating point itself has limitations, and does not permit
    > arbitrarily small or large numbers. Furthermore, note that

    > the user can always take advantage of the bottom kwarg. And
    > if in some extreme case the user has not used the bottom
    > kwarg and the bars really are shorter than the adjustment
    > value, it will probably be quite obvious.

The other thing to think about is choosing a bottom so that the
natural range of the tops is revealed. Eg if the bottom is 1e-200,
all the bars will appear the same height in many cases.

    > It is in ordinary line plotting that adjusting the value
    > could be misleading--it plots an extremely small number (if
    > the data limits are set to include it) instead of zero.
    > Maybe this is enough of a drawback to nix the whole idea.

I am happy with the current behavior of culling the non-positive
points. matlab does it and noone has complained here. There is a
difference in lines created with "plot" and bars created with hist: in
the former case the users explicitly picked the x,y points. In the
latter they implicitly do so with a default bottom kwarg that they may
have overlooked. This suggests to me that we need not treat the two
cases the same.

    > Every alternative that you propose is more complicated and
    > less comprehensive than the low-level adjustment, however,
    > and I see little if any real advantage to the alternatives.

If you would like to take a stab at an implementation I am happy to be
persuaded (with caveats below). In the simple case of

  ax.hist()
  ax.set_xscale('log')

this would indeed be fairly easy because you know how the data were
created. In the general case where the user has added lots-o-patches
to the axes, it may not be easy to do well. I'm still inclined to the
"explicit is better than implicit" and either require them to do one of

  1) use the bottom kwarg
   2) set log scaling before calling hist -- we can make the default
     bottom=None and do different things for linear and log scaling

  3) use a loghist function

    > If you still don't want the adjustment, then the easiest way
    > to improve the error message would be to raise a Python
    > exception instead of a c++ error in places like

    > for(int i=0; i < length; i++) { if (x<=0) { throw
    > std::domain_error("Cannot take log of nonpositive value"); }
    > else newx[i] = log10(x[i]);
    > }

    > The domain error message above is informative, but it never
    > makes it out to the user.

I really don't feel too strongly about this -- my gut reaction is that
a helpful message and an easy way to fix it is enough and it won't get
us into a possible quagmire of trying to be too smart. Personally, I
don't like it when computers try to be too helpful (think MS windows
and clippy); I like it when they do what I tell them to do. With the
snippet I posted previously we can easily warn them before doing the
transformation and with bottom=None we can handle the case when the
log scale is set before the call to hist. That in conjunction with
some additional docstrings in hist should work reasonably well.

That said, if you want to try something more ambitious I won't get in
your way. I recommend at a minimum that you have some artist flag
that governs whether mpl can make helpful data alterations (just as we
do with clip) so the power user can turn it off.

JDH

_Diwaker_Gupta · December 29, 2006, 2:25am

Examples:

This makes a sensible plot that behaves well under zooming and panning:
hist(randn(1000), log=True)
show()

Thanks! However...

The following still generates an exception:
hist(randn(1000))
gca().set_yscale('log')
show()

I think this makes the API more confusing. As an end user, I want the
API to be consistent and intuitive. Its weird if I can set log scale
after all plot commands, but not after calling hist. And yet, log
scale DOES work with that keyword arg. I might even think things were
broken had I not been following this thread.

Anyways, thanks for looking into this!

···

--
Web/Blog/Gallery: http://floatingsun.net/blog

Eric_Firing1 · December 29, 2006, 5:40am

Diwaker Gupta wrote:

Examples:

This makes a sensible plot that behaves well under zooming and panning:
hist(randn(1000), log=True)
show()

Thanks! However...

The following still generates an exception:
hist(randn(1000))
gca().set_yscale('log')
show()

I think this makes the API more confusing. As an end user, I want the
API to be consistent and intuitive. Its weird if I can set log scale
after all plot commands, but not after calling hist. And yet, log
scale DOES work with that keyword arg. I might even think things were
broken had I not been following this thread.

Anyways, thanks for looking into this!

I agree that it is not optimal, but I have not figured out any simple way of making bar (which is the underlying plot command) do the right thing without knowing beforehand whether it is dealing with a log or a linear axis. It simply has to make different choices for the patch (rectangle) corners and the data limits that are used for autoscaling. For this to work correctly in an interactive mode with set_yscale('log') called after the hist or bar command would require a mechanism for undoing and then redoing the hist or bar command. Such a mechanism could be developed, but it would require some structural changes in mpl (or maybe some very ugly hacks), and I don't see that this particular problem is bad enough to motivate such changes.

Actually, I think the sticking point is the autoscaling, not the patches. For the default "bottom=None" case, we could set bottom to 1e-100 for both the log and linear axes, and for all practical purposes it would work correctly for either, indistinguishably from the way it does now. The problem is that with a linear axis we want the axis to start at zero by default, but with a log axis we want it to start a bit below the lowest bar, so the autoscaling is inherently different and needs to be based on a different dataLim bounding box.

Maybe in time we will figure out better solutions.

Eric