Nose tests: Font mismatch

Michael_Droettboom · September 8, 2009, 1:54pm

I've been only skimming the surface of the discussion about the new test framework up until now.

Just got around to trying it, and every comparison failed because it was selecting a different font than that used in the baseline images. (My matplotlibrc customizes the fonts).

It seems we should probably force "font.family" to "Bitstream Vera Sans" when running the tests. Adding "rcParam['font.family'] = 'Bitstream Vera Sans'" to the "test" function seems to do the trick, but I'll let Andrew make the final call about whether that's the right change. Perhaps we should (as with the documentation build) provide a stock matplotlibrc specifically for testing, since there will be other things like this? Of course, all of these options cause matplotlib.test() to have rcParam side-effects. Probably not worth addressing now, but perhaps worth noting.

I am also still getting 6 image comparison failures due to hinting differences (I've attached one of the diffs as an example). Since I haven't been following closely, what's the status on that? Should we be seeing these as failures? What type of hinting are the baseline images produced with?

Mike

_John_Hunter · September 8, 2009, 2:24pm

I've been only skimming the surface of the discussion about the new test
framework up until now.

Just got around to trying it, and every comparison failed because it was
selecting a different font than that used in the baseline images. (My
matplotlibrc customizes the fonts).

It seems we should probably force "font.family" to "Bitstream Vera Sans"
when running the tests. Adding "rcParam['font.family'] = 'Bitstream Vera
Sans'" to the "test" function seems to do the trick, but I'll let Andrew
make the final call about whether that's the right change. Perhaps we
should (as with the documentation build) provide a stock matplotlibrc
specifically for testing, since there will be other things like this? Of
course, all of these options cause matplotlib.test() to have rcParam
side-effects. Probably not worth addressing now, but perhaps worth noting.

We do have a matplotlibrc file in the "test" dir (the dir that lives
next to setup.py, not lib/matplotlib/tests. This is where we run the
buildbot tests from. It might be a good idea to set the font
explicitly in the test code itself so people can run the tests from
any dir, but I'll leave it to Andrew to weigh in on that.

I am also still getting 6 image comparison failures due to hinting
differences (I've attached one of the diffs as an example). Since I haven't
been following closely, what's the status on that? Should we be seeing
these as failures? What type of hinting are the baseline images produced
with?

We ended up deciding to do identical source builds of freetype to make
sure there were no version differences or freetype configuration
differences. We are using freetype 2.3.5 with the default
configuration. We have seen other versions, eg 2.3.7, even in the
default configuration, give rise to different font renderings, as you
are seeing. This will make testing hard for plain-ol-users, since it
is a lot to ask them to install a special version of freetype for
testing. The alternative, which we discussed before, is to expose the
unhinted option to the frontend, and do all testing with unhinted
text.

JDH

···

On Tue, Sep 8, 2009 at 8:54 AM, Michael Droettboom<mdroe@...31...> wrote:

Michael_Droettboom · September 8, 2009, 3:26pm

I've been only skimming the surface of the discussion about the new test
framework up until now.

Just got around to trying it, and every comparison failed because it was
selecting a different font than that used in the baseline images. (My
matplotlibrc customizes the fonts).

It seems we should probably force "font.family" to "Bitstream Vera Sans"
when running the tests. Adding "rcParam['font.family'] = 'Bitstream Vera
Sans'" to the "test" function seems to do the trick, but I'll let Andrew
make the final call about whether that's the right change. Perhaps we
should (as with the documentation build) provide a stock matplotlibrc
specifically for testing, since there will be other things like this? Of
course, all of these options cause matplotlib.test() to have rcParam
side-effects. Probably not worth addressing now, but perhaps worth noting.


We do have a matplotlibrc file in the "test" dir (the dir that lives
next to setup.py, not lib/matplotlib/tests. This is where we run the
buildbot tests from. It might be a good idea to set the font
explicitly in the test code itself so people can run the tests from
any dir, but I'll leave it to Andrew to weigh in on that.

Sure. If we *don't* decide to set it in the code, we should perhaps add a line suggesting to "run the tests from lib/matplotlib/tests" in the documentation. An even better solution might be to forcibly load the matplotlibrc in that directory (even if it's an install directory) when the tests are run.

I am also still getting 6 image comparison failures due to hinting
differences (I've attached one of the diffs as an example). Since I haven't
been following closely, what's the status on that? Should we be seeing
these as failures? What type of hinting are the baseline images produced
with?


We ended up deciding to do identical source builds of freetype to make
sure there were no version differences or freetype configuration
differences. We are using freetype 2.3.5 with the default
configuration. We have seen other versions, eg 2.3.7, even in the
default configuration, give rise to different font renderings, as you
are seeing. This will make testing hard for plain-ol-users, since it
is a lot to ask them to install a special version of freetype for
testing. The alternative, which we discussed before, is to expose the
unhinted option to the frontend, and do all testing with unhinted
text.

I just committed a change to add a "text.hinting" rcParam (which is currently only followed by the Agg backend, though it might make sense for Cairo and macosx to also obey it). This param is then forcibly set to False when the tests are run.

Doing so, my results are even *less* in agreement with the baseline, but the real question is whether my results are in agreement with those on the buildbot machines with this change to forcibly turn hinting off. I should no pretty quickly when the buildbots start complaining in a few minutes and I can look at the results

Hopefully we can find a way for Joe Developer to run these tests without a custom build of freetype. FWIW, I'm using the freetype 2.3.9 packaged with FC11.

Cheers,
Mike

···

On 09/08/2009 10:24 AM, John Hunter wrote:

On Tue, Sep 8, 2009 at 8:54 AM, Michael Droettboom<mdroe@...31...> wrote:

Andrew_Straw5 · September 8, 2009, 3:46pm

Michael Droettboom wrote:

I've been only skimming the surface of the discussion about the new test
framework up until now.

Just got around to trying it, and every comparison failed because it was
selecting a different font than that used in the baseline images. (My
matplotlibrc customizes the fonts).

It seems we should probably force "font.family" to "Bitstream Vera Sans"
when running the tests. Adding "rcParam['font.family'] = 'Bitstream Vera
Sans'" to the "test" function seems to do the trick, but I'll let Andrew
make the final call about whether that's the right change. Perhaps we
should (as with the documentation build) provide a stock matplotlibrc
specifically for testing, since there will be other things like this? Of
course, all of these options cause matplotlib.test() to have rcParam
side-effects. Probably not worth addressing now, but perhaps worth noting.


We do have a matplotlibrc file in the "test" dir (the dir that lives
next to setup.py, not lib/matplotlib/tests. This is where we run the
buildbot tests from. It might be a good idea to set the font
explicitly in the test code itself so people can run the tests from
any dir, but I'll leave it to Andrew to weigh in on that.


Sure. If we *don't* decide to set it in the code, we should perhaps add a line suggesting to "run the tests from lib/matplotlib/tests" in the documentation. An even better solution might be to forcibly load the matplotlibrc in that directory (even if it's an install directory) when the tests are run.

While the default test usage should probably set as much as possible to ensure things are identical, we also want to be able to test other code paths, so I think I'll add some kind of kwarg to matplotlib.test() to handle non-testing-default rcParams. I think setting lots of things, including the font, explicitly in the default case is a good idea.

Question for the rcParams experts: Can we save a copy of it so that we can restore its state after matplotlib.test() is done? (It's just a dictionary, right?)

I am also still getting 6 image comparison failures due to hinting
differences (I've attached one of the diffs as an example). Since I haven't
been following closely, what's the status on that? Should we be seeing
these as failures? What type of hinting are the baseline images produced
with?


We ended up deciding to do identical source builds of freetype to make
sure there were no version differences or freetype configuration
differences. We are using freetype 2.3.5 with the default
configuration. We have seen other versions, eg 2.3.7, even in the
default configuration, give rise to different font renderings, as you
are seeing. This will make testing hard for plain-ol-users, since it
is a lot to ask them to install a special version of freetype for
testing. The alternative, which we discussed before, is to expose the
unhinted option to the frontend, and do all testing with unhinted
text.


I just committed a change to add a "text.hinting" rcParam (which is currently only followed by the Agg backend, though it might make sense for Cairo and macosx to also obey it). This param is then forcibly set to False when the tests are run.

Doing so, my results are even *less* in agreement with the baseline, but the real question is whether my results are in agreement with those on the buildbot machines with this change to forcibly turn hinting off. I should no pretty quickly when the buildbots start complaining in a few minutes and I can look at the results

I think we compiled freetype with no hinting as a configuration option, so I don't anticipate a failure.

Of course, now I look at the waterfall display, see a bunch of green, think "this looks suspicious" (what does that say about my personality?), click the log of the stdio of the "test" components and see a whole bunch of errors. It seems when I switched over to the matplotlib.test() call for running the tests, I forgot to set the exit code. Let me do that right now. Expect a flood of buildbot errors in the near future...

Hopefully we can find a way for Joe Developer to run these tests without a custom build of freetype.

Yes, I completely agree. In the matplotlib.testing.image_comparison() decorator, we right now have only a single image comparison algorithm based on RMS error. Perhaps we could try the perceptual difference code you linked to? Also, maybe we could simply turn font rendering off completely for a majority of the tests? Or maybe the tests should be run with and without text drawn, with much lower error tolerances when there's no text?

The nice thing about our test infrastructure now is that it's pretty small, lightweight, and flexible. The image comparison stuff is just done in a single decorator function, and the only nose plugin is the "known failure" plugin. We can continue writing tests which I hope will be mostly indepdendent from improving the infrastructure.

···

On 09/08/2009 10:24 AM, John Hunter wrote:

On Tue, Sep 8, 2009 at 8:54 AM, Michael Droettboom<mdroe@...31...> wrote:

_John_Hunter · September 8, 2009, 3:52pm

While the default test usage should probably set as much as possible to
ensure things are identical, we also want to be able to test other code
paths, so I think I'll add some kind of kwarg to matplotlib.test() to handle
non-testing-default rcParams. I think setting lots of things, including the
font, explicitly in the default case is a good idea.

Question for the rcParams experts: Can we save a copy of it so that we can
restore its state after matplotlib.test() is done? (It's just a dictionary,
right?)

I committed this change

Yes, I completely agree. In the matplotlib.testing.image_comparison()
decorator, we right now have only a single image comparison algorithm based
on RMS error. Perhaps we could try the perceptual difference code you linked
to? Also, maybe we could simply turn font rendering off completely for a
majority of the tests? Or maybe the tests should be run with and without
text drawn, with much lower error tolerances when there's no text?

Perhaps with hinting turned off this won't be necessary. Ie, maybe we
can get more agreement across a wide range of freetype versions w/o
hinting. Are you planning on committing the unhinted baselines?

JDH

···

On Tue, Sep 8, 2009 at 10:46 AM, Andrew Straw<strawman@...36...> wrote:

_Darren_Dale2 · September 8, 2009, 3:57pm

Michael Droettboom wrote:

I've been only skimming the surface of the discussion about the new test
framework up until now.

Just got around to trying it, and every comparison failed because it was
selecting a different font than that used in the baseline images. (My
matplotlibrc customizes the fonts).

It seems we should probably force "font.family" to "Bitstream Vera Sans"
when running the tests. Adding "rcParam['font.family'] = 'Bitstream Vera
Sans'" to the "test" function seems to do the trick, but I'll let Andrew
make the final call about whether that's the right change. Perhaps we
should (as with the documentation build) provide a stock matplotlibrc
specifically for testing, since there will be other things like this? Of
course, all of these options cause matplotlib.test() to have rcParam
side-effects. Probably not worth addressing now, but perhaps worth noting.

We do have a matplotlibrc file in the "test" dir (the dir that lives
next to setup.py, not lib/matplotlib/tests. This is where we run the
buildbot tests from. It might be a good idea to set the font
explicitly in the test code itself so people can run the tests from
any dir, but I'll leave it to Andrew to weigh in on that.

Sure. If we *don't* decide to set it in the code, we should perhaps add
a line suggesting to "run the tests from lib/matplotlib/tests" in the
documentation. An even better solution might be to forcibly load the
matplotlibrc in that directory (even if it's an install directory) when
the tests are run.

While the default test usage should probably set as much as possible to
ensure things are identical, we also want to be able to test other code
paths, so I think I'll add some kind of kwarg to matplotlib.test() to
handle non-testing-default rcParams. I think setting lots of things,
including the font, explicitly in the default case is a good idea.

I think the defaults should be used and any non-standard settings
should explicitly define the rcSettings. Perhaps a decorator is needed
that lets you pass the rc values, runs the test, and then calls
rcdefaults on the way out? I haven't been following your progress
closely (sorry about that, I am very grateful for all the work you are
doing.)

Question for the rcParams experts: Can we save a copy of it so that we
can restore its state after matplotlib.test() is done? (It's just a
dictionary, right?)

rcdefaults() should reset all the values for you.

Darren

···

On Tue, Sep 8, 2009 at 11:46 AM, Andrew Straw<strawman@...36...> wrote:

On 09/08/2009 10:24 AM, John Hunter wrote:

On Tue, Sep 8, 2009 at 8:54 AM, Michael Droettboom<mdroe@...31...> wrote:

Andrew_Straw5 · September 8, 2009, 4:00pm

John Hunter wrote:

Perhaps with hinting turned off this won't be necessary. Ie, maybe we
can get more agreement across a wide range of freetype versions w/o
hinting. Are you planning on committing the unhinted baselines?

I have a presentation to give tomorrow, so I'd just as soon let you and
Michael fight the flood of red that is about to occur!

But I can step up again later in the week for with more time. In the
meantime, why don't I just keep my eye on my email inbox but stay out of
the code and baseline images for the most part?

-Andrew

Andrew_Straw5 · September 8, 2009, 4:06pm

Michael Droettboom wrote:

Doing so, my results are even *less* in agreement with the baseline, but
the real question is whether my results are in agreement with those on
the buildbot machines with this change to forcibly turn hinting off. I
should no pretty quickly when the buildbots start complaining in a few
minutes and I can look at the results

Yes, even though the waterfall is showing green (for the next 2 minutes
until my buildbot script bugfix gets run), it's pretty clear from the
image failure page that disabling hinting introduced changes to the
generated figure appearance. It will be interesting to see if, after
checking in the newly generated actual images as the new baseline, the
tests start passing on your machine with the newer freetype.

In a footnote to myself, I think the ImageComparisonFailure exception
should tell nose that the test failed, not that there was an error.

-Andrew

Michael_Droettboom · September 8, 2009, 4:47pm

Interesting result. I pulled all of the new "actual" files from the 21 failing tests on the buildbots to my local machine and all of those tests now pass for me. Good. Interestingly, there are still two tests failing on my machine which did not fail on the buildbots, so I can't grab the buildbots' new output. Could this just be a thresholding issue for the tolerance value? I'm a little wary of "polluting" the baseline images with images from my machine which doesn't have our "standard" version of Freetype, so I'll leave those out of SVN for now, but will go ahead and commit the new baseline images from the buildbots. Assuming these two mystery failures are resolved by pulling new images from the buildbots, I think this experiment with turning of hinting is a success.

As an aside, is there an easy way to update the baselines I'm missing? At the moment, I'm copying each result file to the correct folder under tests/baseline_images, but it takes me a while because I don't know the heirarchy by heart and there are 22 failures. I was expecting to just manually verify everything was ok and then "cp *.png" from my scratch tests folder to baseline_images and let SVN take care of which files had actually changed. This is just the naive feedback of a new set of eyes: it's extremely useful and powerful what you've put together here.

Mike

···

On 09/08/2009 12:06 PM, Andrew Straw wrote:

Michael Droettboom wrote:

Doing so, my results are even *less* in agreement with the baseline, but
the real question is whether my results are in agreement with those on
the buildbot machines with this change to forcibly turn hinting off. I
should no pretty quickly when the buildbots start complaining in a few
minutes and I can look at the results

Yes, even though the waterfall is showing green (for the next 2 minutes
until my buildbot script bugfix gets run), it's pretty clear from the
image failure page that disabling hinting introduced changes to the
generated figure appearance. It will be interesting to see if, after
checking in the newly generated actual images as the new baseline, the
tests start passing on your machine with the newer freetype.

In a footnote to myself, I think the ImageComparisonFailure exception
should tell nose that the test failed, not that there was an error.

-Andrew

Andrew_Straw5 · September 8, 2009, 5:28pm

Michael Droettboom wrote:

Interesting result. I pulled all of the new "actual" files from the 21
failing tests on the buildbots to my local machine and all of those
tests now pass for me. Good. Interestingly, there are still two tests
failing on my machine which did not fail on the buildbots, so I can't
grab the buildbots' new output.

Well, if they're not failing on the buildbots, that means the baseline
in svn can't be too different than what they generate. But it's a good
point that we want the actual output of the buildbots regardless of
whether the test failed.

Could this just be a thresholding issue
for the tolerance value? I'm a little wary of "polluting" the baseline
images with images from my machine which doesn't have our "standard"
version of Freetype, so I'll leave those out of SVN for now, but will go
ahead and commit the new baseline images from the buildbots.

Looking at the 2 images failing on the buildbots, I'm reasonably sure
they were generated by James Evans when he created the first test
infrastructure. So I say go ahead an check in the actual images
generated by the buildbots. (Or did you recently re-upload those images?)

Assuming
these two mystery failures are resolved by pulling new images from the
buildbots, I think this experiment with turning of hinting is a success.

Yes, I think so, too. I was going to suggest getting on the freetype
email list to ask them about their opinion on what we're doing.

As an aside, is there an easy way to update the baselines I'm missing?
At the moment, I'm copying each result file to the correct folder under
tests/baseline_images, but it takes me a while because I don't know the
heirarchy by heart and there are 22 failures. I was expecting to just
manually verify everything was ok and then "cp *.png" from my scratch
tests folder to baseline_images and let SVN take care of which files had
actually changed.

Unfortunately, there's no easy baseline update yet. John wrote one for
the old test infrastructure, but I ended up dropping that in the
switchover to the simplified infrastructure. The reason was that the
image comparison mechanism, and the directories to which they were
saved, changed, and thus his script would have require a re-working.
Given that I don't consider the current mechanism for this particularly
good, I was hesitant to invest the effort to port over support for a
crappy layout.

(The trouble with the current actual/baseline/diff result gathering
mechanism is that it uses the filesystem as a means for communication
withing the nose test running process in addition to communication with
the buildbot process through hard-coded assumptions about paths and
filenames. If the only concern was within nose, we could presumably
re-work some the old MplNoseTester plugin to handle the new case, but
given the buildbot consideration it gets more difficult to get these
frameworks talking through supported API calls. Thus, although the
hardcoded path and filename stuff is a hack, it will require some
serious nose and buildbot learning to figure out how to do it the
"right" way. So I'm all for sticking with the hack right now, and making
a bit nicer by doing things like having a better directory hierarchy
layout for the actual result images.)

This is just the naive feedback of a new set of eyes:
it's extremely useful and powerful what you've put together here.

Thanks for the feedback.

The goal is that Joe Dev would think it's easy and useful and thus start
using it. Tests should be simple to write and run so that we actually do
that. Like I wrote earlier, by keeping the tests themselves simple and
clean, I hope we can improve the testing infrastructure mostly
independently of changes to the tests themselves.

-Andrew

_John_Hunter · September 8, 2009, 5:28pm

Interesting result. I pulled all of the new "actual" files from the 21
failing tests on the buildbots to my local machine and all of those tests
now pass for me. Good. Interestingly, there are still two tests failing on
my machine which did not fail on the buildbots, so I can't grab the
buildbots' new output. Could this just be a thresholding issue for the
tolerance value? I'm a little wary of "polluting" the baseline images with
images from my machine which doesn't have our "standard" version of
Freetype, so I'll leave those out of SVN for now, but will go ahead and
commit the new baseline images from the buildbots. Assuming these two
mystery failures are resolved by pulling new images from the buildbots, I
think this experiment with turning of hinting is a success.

Are these two images you are referring to the formatter_ticker_002.png
and polar_wrap_360.png failures? I just committed those from the
actual output on the sage buildbot. But I am curious why you couldn't
pull these down from the buildbot, eg

http://mpl.code.astraw.com/hardy-py24-amd64-chroot/formatter_ticker_002/actual.png
http://mpl.code.astraw.com/hardy-py24-amd64-chroot/polar_wrap_360/actual.gif

As an aside, is there an easy way to update the baselines I'm missing? At
the moment, I'm copying each result file to the correct folder under
tests/baseline_images, but it takes me a while because I don't know the
heirarchy by heart and there are 22 failures. I was expecting to just
manually verify everything was ok and then "cp *.png" from my scratch tests
folder to baseline_images and let SVN take care of which files had actually
changed. This is just the naive feedback of a new set of eyes: it's
extremely useful and powerful what you've put together here.

I wrote a script at scipy when Andrew and I worked on this to
recursively move known good actuals into the baselines directory, with
some yes/no prompting, but it looks like it did not survive the test
code migration, so we may want to develop something to replace it.

JDH

···

On Tue, Sep 8, 2009 at 11:47 AM, Michael Droettboom<mdroe@...31...> wrote:

Michael_Droettboom · September 8, 2009, 5:28pm

More information after another build iteration.

The two tests that failed after updating to the unhinted images were subtests of tests that were failing earlier. If a single test function outputs multiple images, image comparison stops after the first mismatched image. So there's nothing peculiar about these tests, it's just that the system wasn't saying they were failing before since they were short-circuited by earlier failures. I wonder if it's possible to run through all the images and batch up all the failures together, so we don't have these "hidden" failures -- might mean fewer iterations with the buildbots down the road.

Good news is this does point to having the font problem licked.

Mike

···

On 09/08/2009 12:47 PM, Michael Droettboom wrote:

Interesting result. I pulled all of the new "actual" files from the 21
failing tests on the buildbots to my local machine and all of those
tests now pass for me. Good. Interestingly, there are still two tests
failing on my machine which did not fail on the buildbots, so I can't
grab the buildbots' new output. Could this just be a thresholding issue
for the tolerance value? I'm a little wary of "polluting" the baseline
images with images from my machine which doesn't have our "standard"
version of Freetype, so I'll leave those out of SVN for now, but will go
ahead and commit the new baseline images from the buildbots. Assuming
these two mystery failures are resolved by pulling new images from the
buildbots, I think this experiment with turning of hinting is a success.

As an aside, is there an easy way to update the baselines I'm missing?
At the moment, I'm copying each result file to the correct folder under
tests/baseline_images, but it takes me a while because I don't know the
heirarchy by heart and there are 22 failures. I was expecting to just
manually verify everything was ok and then "cp *.png" from my scratch
tests folder to baseline_images and let SVN take care of which files had
actually changed. This is just the naive feedback of a new set of eyes:
it's extremely useful and powerful what you've put together here.

Mike

On 09/08/2009 12:06 PM, Andrew Straw wrote:

Michael Droettboom wrote:

Doing so, my results are even *less* in agreement with the baseline, but
the real question is whether my results are in agreement with those on
the buildbot machines with this change to forcibly turn hinting off. I
should no pretty quickly when the buildbots start complaining in a few
minutes and I can look at the results

Yes, even though the waterfall is showing green (for the next 2 minutes
until my buildbot script bugfix gets run), it's pretty clear from the
image failure page that disabling hinting introduced changes to the
generated figure appearance. It will be interesting to see if, after
checking in the newly generated actual images as the new baseline, the
tests start passing on your machine with the newer freetype.

In a footnote to myself, I think the ImageComparisonFailure exception
should tell nose that the test failed, not that there was an error.

-Andrew

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Matplotlib-devel mailing list
Matplotlib-devel@lists.sourceforge.net
matplotlib-devel List Signup and Options

Andrew_Straw5 · September 8, 2009, 5:34pm

Michael Droettboom wrote:

More information after another build iteration.

The two tests that failed after updating to the unhinted images were
subtests of tests that were failing earlier. If a single test
function outputs multiple images, image comparison stops after the
first mismatched image. So there's nothing peculiar about these
tests, it's just that the system wasn't saying they were failing
before since they were short-circuited by earlier failures. I wonder
if it's possible to run through all the images and batch up all the
failures together, so we don't have these "hidden" failures -- might
mean fewer iterations with the buildbots down the road.

Ahh, good point. I can collect the failures in the image_comparison()
decorator and raise one failure that describes all the failed images.
Right now the loop that iterates over the images raises an exception on
the first failure, which clearly breaks out of the loop. I'd added it to
the nascent TODO list, which I'll check into the repo next to
_buildbot_test.py.

Good news is this does point to having the font problem licked.

Very good news indeed.

Andrew_Straw5 · September 8, 2009, 5:36pm

John Hunter wrote:

I wrote a script at scipy when Andrew and I worked on this to
recursively move known good actuals into the baselines directory, with
some yes/no prompting, but it looks like it did not survive the test
code migration, so we may want to develop something to replace it.

Yes, we do. But I think we should hold off a bit until I get a slightly
better output image hierarchy established. (See my other post for more
detailed thoughts -- our emails crossed in the ether.)

-Andrew

_John_Hunter · September 8, 2009, 6:00pm

Should I hold off on committing the other formatter baselines until
you have made these changes so you can test, or do you want me to go
ahead and commit the rest of these now?

···

On Tue, Sep 8, 2009 at 12:34 PM, Andrew Straw<strawman@...36...> wrote:

Michael Droettboom wrote:

More information after another build iteration.

The two tests that failed after updating to the unhinted images were
subtests of tests that were failing earlier. If a single test
function outputs multiple images, image comparison stops after the
first mismatched image. So there's nothing peculiar about these
tests, it's just that the system wasn't saying they were failing
before since they were short-circuited by earlier failures. I wonder
if it's possible to run through all the images and batch up all the
failures together, so we don't have these "hidden" failures -- might
mean fewer iterations with the buildbots down the road.

Ahh, good point. I can collect the failures in the image_comparison()
decorator and raise one failure that describes all the failed images.
Right now the loop that iterates over the images raises an exception on
the first failure, which clearly breaks out of the loop. I'd added it to
the nascent TODO list, which I'll check into the repo next to
_buildbot_test.py.

Andrew_Straw5 · September 8, 2009, 6:11pm

John Hunter wrote:

···

On Tue, Sep 8, 2009 at 12:34 PM, Andrew Straw<strawman@...36...> wrote:


Michael Droettboom wrote:


More information after another build iteration.

The two tests that failed after updating to the unhinted images were
subtests of tests that were failing earlier. If a single test
function outputs multiple images, image comparison stops after the
first mismatched image. So there's nothing peculiar about these
tests, it's just that the system wasn't saying they were failing
before since they were short-circuited by earlier failures. I wonder
if it's possible to run through all the images and batch up all the
failures together, so we don't have these "hidden" failures -- might
mean fewer iterations with the buildbots down the road.


Ahh, good point. I can collect the failures in the image_comparison()
decorator and raise one failure that describes all the failed images.
Right now the loop that iterates over the images raises an exception on
the first failure, which clearly breaks out of the loop. I'd added it to
the nascent TODO list, which I'll check into the repo next to
_buildbot_test.py.

Should I hold off on committing the other formatter baselines until
you have made these changes so you can test, or do you want me to go
ahead and commit the rest of these now?

Go ahead -- please don't wait for me. I have many means of causing image
comparison failures when the time comes.

-Andrew