debugging python code

Jochen_Voss4 · December 1, 2004, 6:39pm

Hello Perry,

why not "x = sin(t)"? That should work. No need to use map or math.sin

true, this work as such but the crash is still there.

Does it plot interactively? Do you get any error message or does it
just crash out of python?

No, it does not plot interactively either.
If I run the script

    from matplotlib.matlab import *
    t=arange(0,10,0.1)
    x=sin(t)
    plot(t,x)
    print "fisch"
    show()

I get the following output:

    voss@...8... [~/src/mpl/test] ./test.py --numarray
    fisch
    Floating point exception

Thank you very much,
Jochen

···

On Wed, Dec 01, 2004 at 01:28:57PM -0500, Perry Greenfield wrote:
--

Andrew_Straw5 · December 2, 2004, 2:38am

Jochen Voss wrote:

I get the following output:

voss@...8... [~/src/mpl/test] ./test.py --numarray fisch
Floating point exception

Thank you very much,
Jochen

It sounds like you may have hit a nasty glibc bug that caused me much head scratching over the months. Check this thread:

Bottom line:

numarray does The Right Thing and attempts to set up floating point exception handling, but older versions of glibc (such as that in Debian sarge) have a bug whereby the floating point error bits in the SSE are not properly cleared, leading to a SIGFPE terminating the program the next time the SSE unit is used.

One solution:

Rebuild glibc with the appropriate patch.

Jochen_Voss4 · December 5, 2004, 5:04pm

Hello everybody,

It sounds like you may have hit a nasty glibc bug that caused me much
head scratching over the months. Check this thread:

ActiveState Community - Boosting coder and team productivity with ready-to-use open source languages and tools.

Bottom line:

numarray does The Right Thing and attempts to set up floating point
exception handling, but older versions of glibc (such as that in Debian
sarge) have a bug whereby the floating point error bits in the SSE are
not properly cleared, leading to a SIGFPE terminating the program the
next time the SSE unit is used.

Yes, this was the problem. I applied the patch from

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=279294

and the problem disappeared. Does this imply that on current
Debian/unstable systems matplotlib can only be used with
python-numeric and not with python-numarray?

By the way: I am still interested in my original question. How would
I use a debugger etc. to find the problem myself in such a situation?

Many thanks,
Jochen

···

On Wed, Dec 01, 2004 at 06:38:47PM -0800, Andrew Straw wrote:
--

Andrew_Straw5 · December 6, 2004, 2:10am

Hello everybody,

It sounds like you may have hit a nasty glibc bug that caused me much
head scratching over the months. Check this thread:

ActiveState Community - Boosting coder and team productivity with ready-to-use open source languages and tools.

Bottom line:

numarray does The Right Thing and attempts to set up floating point
exception handling, but older versions of glibc (such as that in Debian
sarge) have a bug whereby the floating point error bits in the SSE are
not properly cleared, leading to a SIGFPE terminating the program the
next time the SSE unit is used.

Yes, this was the problem. I applied the patch from

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=279294

and the problem disappeared. Does this imply that on current
Debian/unstable systems matplotlib can only be used with
python-numeric and not with python-numarray?

You can remove atlas3-sse2 (and perhaps atlas3-sse), and it should run OK. Sorry I didn't remember this earlier.

I have to admit that I'm a little disappointed in the speed of this bugfix going into the Debian sources. I'd think that since I submitted a 2 line (1 of which was comments) patch over a month ago, which was copied directly from upstream, this would be about 2 minutes of work for someone... Maybe I should have put it at a higher priority than "Normal".

By the way: I am still interested in my original question. How would
I use a debugger etc. to find the problem myself in such a situation?

If you find out, let me know! Seriously, having a process killed by the kernel because of a signal was difficult for me to debug. (Python is supposed to insulate you from this kind of low-level stuff and generally does a fantastic job.) I had no idea where the FPE signal was coming from or why. I first came across this in the context of a numarray/Intel IPP program. Because IPP is closed source, I didn't go very far initially, and just converted my program to Numeric. Then, I encountered the same thing purely within numarray and knew it was within my grasp. By making a minimal program that exhibited the bug, I determined that the floating point error checking and setting code executed on import of numarray.ieeespecial would cause a floating point error (SIGFPE) in later matrix calls (e.g. numarray.linear_algebra.singular_value_decomposition). I began to suspect the SSE unit because this code ran fine when I compiled numarray with its built-in lapack_lite.

Strangely, running python from within gdb did not terminate or indicate that a SIGFPE was raised. This I'd like to understand a bit better...

Cheers!
Andrew

···

On Dec 5, 2004, at 9:04 AM, Jochen Voss wrote:

On Wed, Dec 01, 2004 at 06:38:47PM -0800, Andrew Straw wrote:

Steve_Chaplin1 · December 6, 2004, 9:35am

By the way: I am still interested in my original question. How would
I use a debugger etc. to find the problem myself in such a situation?

I should know the answer because I created the glibc patch that fixes
the problem, but it was back in February and I can't remember all the
details.
It started when I thought, why should I run my AthlonXP in '386 emulator
mode' when I can use 'gcc -march=athlon-xp' and actually benefit from
the extra instructions my processor supports. This worked fine until I
compiled numarray and it failed its own tests with a floating-point
exception. But if I used the default gcc settings it worked OK. I filed
a numarray bug report (which I can no longer locate, perhaps they get
deleted after a certain date), they looked at it and said it was
probably a gcc bug. I filed a gcc bug report, and they closed it saying
it was not a gcc bug.
Then I thought it might be a bug with the way kernel handles FP
exceptions and started looking through the kernel sources, but did not
make much progress. So I went back to the numarray source code and tried
no narrow down where the problem was occurring.

Now to answer your question:
Consider you are on a TV game show where you have to guess a number x in
the range 1 to y and are told 'higher', 'lower' or 'correct' after each
turn. You can use a binary search and always guess the mid point of the
range - you are either correct or eliminate half of the possibilities
each turn, so in ceil log(y, 2) turns or less you locate the correct
number.

You can use a similar kind of binary search to locate bugs in software.
You know the bug occurs on some line x of the source code with y lines.
Use gdb and insert breakpoints in the code (I think I just inserted
printf() statements instead of using gdb) and see if the error occurs
before or after the breakpoint, move the breakpoint and try again. The
problem is that source code is rarely a linear list of statements in one
file that are executed in order, but a set of procedures/functions in
many files where the execution order can vary. You can start at the main
() function, split it in half and insert a breakpoint (or printf()) run
it and see in which half the error occurs, repeat the process working
your way down into other functions until you pinpoint the error.

Hope that makes sense. You could now reinstall the old glibc, forget
that you know that glibc is the problem and start again to locate the
bug, it will be useful practise for the next bug that comes along!

Steve

Jochen_Voss4 · December 6, 2004, 6:42pm

Hello Andrew,

I have to admit that I'm a little disappointed in the speed of this
bugfix going into the Debian sources. I'd think that since I submitted
a 2 line (1 of which was comments) patch over a month ago, which was
copied directly from upstream, this would be about 2 minutes of work
for someone...

Yes, this can be a pain. I think the first thing to do
is to add more information to the bug report log. I guess
the the Debian libc-maintainers are short of time and have
problems to easily see whether the bug is actually a bug and the
fix is actually a fix.

I will try to add more information to the bug report log.
Maybe this helps the patch being applied.

All the best,
Jochen

···

On Sun, Dec 05, 2004 at 06:10:47PM -0800, Andrew Straw wrote:
--

Jochen_Voss4 · December 6, 2004, 6:53pm

Hello Andrew,

···

On Mon, Dec 06, 2004 at 06:42:03PM +0000, Jochen Voss wrote:

I will try to add more information to the bug report log.
Maybe this helps the patch being applied.

It turns out that I do not understand enough of this to
produce an illustrative example. Do you have a minimal
C program which terminates with SIGFPE because of this bug
where it shouldn't?

All the best,
Jochen
--

_Perry_Greenfield · December 6, 2004, 8:34pm

Hello everybody,

It sounds like you may have hit a nasty glibc bug that caused me much
head scratching over the months. Check this thread:

ActiveState Community - Boosting coder and team productivity with ready-to-use open source languages and tools.

Bottom line:

numarray does The Right Thing and attempts to set up floating point
exception handling, but older versions of glibc (such as that in Debian
sarge) have a bug whereby the floating point error bits in the SSE are
not properly cleared, leading to a SIGFPE terminating the program the
next time the SSE unit is used.

Yes, this was the problem. I applied the patch from

I really appreciate Andrew's diagnosing the original problem and particularly in recognizing it as possibility here. This is a nasty kind of bug to figure out.

By the way: I am still interested in my original question. How would
I use a debugger etc. to find the problem myself in such a situation?

If you find out, let me know! Seriously, having a process killed by the kernel

Us too!

Perry

···

On Dec 5, 2004, at 9:10 PM, Andrew Straw wrote:

On Dec 5, 2004, at 9:04 AM, Jochen Voss wrote:

On Wed, Dec 01, 2004 at 06:38:47PM -0800, Andrew Straw wrote:

Steve_Chaplin1 · December 7, 2004, 2:52am

The original bug was reported to numarray developers via the SourceForge
bug tracking system back in February, the glibc patch was also applied
in February. From Numarray 1.0 onwards a 'Special Note' has been
included in the file numarray/Doc/Install.txt referencing the problem.

I believe the SourceForge bug report was the one
870660 Numarray: CFLAGS build problem
yet for some reason I can't locate it anymore. Perhaps thats one of the
reasons that the problem keeps getting rediscovered.

This is the glibc bug report
http://sources.redhat.com/bugzilla/show_bug.cgi?id=10

Steve

···

On Mon, 2004-12-06 at 15:34 -0500, Perry Greenfield wrote:

I really appreciate Andrew's diagnosing the original problem and
particularly in recognizing it as possibility here. This is a nasty
kind of bug to figure out.

Andrew_Straw5 · December 7, 2004, 7:07am

I really appreciate Andrew's diagnosing the original problem and
particularly in recognizing it as possibility here. This is a nasty
kind of bug to figure out.

The original bug was reported to numarray developers

Probably by the too-modest Steve Chaplin, I suspect. I forgot in my previous email that a significant component of my late-phase debugging consisted of emailing the numarray list, and getting an email from Steven Chaplin, who had independently diagnosed the problem. He had already gone much further than I -- he's the one who submitted the bug report and patch to the glibc itself:

This is the glibc bug report
http://sources.redhat.com/bugzilla/show_bug.cgi?id=10

Jochen, that bug report contains a C program which replicates the bug. Perhaps you could send that test program to the Debian bug tracking system to spur patching? (There is an additional comment on the glibc bugzilla page saying "The test program isn't really testing what it is supposed to (the SSE status is never touched) but the SSE control change is indeed wrong." You may want to address this first if you're up for this kind of low-level fun.)

To summarize, we owe a big thanks to Steve Chaplin. A heartfelt thanks, Steve!

Cheers!
Andrew

···

On Dec 6, 2004, at 6:52 PM, Steve Chaplin wrote:

On Mon, 2004-12-06 at 15:34 -0500, Perry Greenfield wrote:

Todd_Miller · December 7, 2004, 11:28am

Here's the "numarray bugs tracker" link for this report:

http://sourceforge.net/tracker/index.php?func=detail&aid=870660&group_id=1369&atid=450446

My guess is that you were looking in the "numpy bugs tracker" where the
bug was originally filed but which is supposed to be for Numeric:

Numarray bugs which are filed in the numpy bugs tracker are moved to the
numarray bugs tracker. They're both on the same SF project, but the
numarray tracker is more hidden. I'm sorry this is confusing.

Regards,
Todd

···

On Tue, 2004-12-07 at 10:52 +0800, Steve Chaplin wrote:

On Mon, 2004-12-06 at 15:34 -0500, Perry Greenfield wrote:
> I really appreciate Andrew's diagnosing the original problem and
> particularly in recognizing it as possibility here. This is a nasty
> kind of bug to figure out.
The original bug was reported to numarray developers via the SourceForge
bug tracking system back in February, the glibc patch was also applied
in February. From Numarray 1.0 onwards a 'Special Note' has been
included in the file numarray/Doc/Install.txt referencing the problem.

I believe the SourceForge bug report was the one
870660 Numarray: CFLAGS build problem
yet for some reason I can't locate it anymore.

_Perry_Greenfield · December 7, 2004, 4:27pm

Sorry about that. I should have also thanked Steve for doing the hard part.

···

On Dec 7, 2004, at 2:07 AM, Andrew Straw wrote:

On Dec 6, 2004, at 6:52 PM, Steve Chaplin wrote:

On Mon, 2004-12-06 at 15:34 -0500, Perry Greenfield wrote:

I really appreciate Andrew's diagnosing the original problem and
particularly in recognizing it as possibility here. This is a nasty
kind of bug to figure out.

The original bug was reported to numarray developers

Probably by the too-modest Steve Chaplin, I suspect. I forgot in my previous email that a significant component of my late-phase debugging consisted of emailing the numarray list, and getting an email from Steven Chaplin, who had independently diagnosed the problem. He had already gone much further than I -- he's the one who submitted the bug report and patch to the glibc itself: