[PATCH] don't print undefined glyphs in ps backend

Hi,

currently, PS backend does not work well with some fonts. For
instance, it displays a dotted square instead of whitespace with
Arial, and some strange dots instead of whitespace with Times New
Roman. This patch fixes it by omitting glyphs named ".notdef" from PS
output.

1.patch (648 Bytes)

Sorry, I was too hasty. The patch is wrong, here is the real reason:

FT2Font.get_charmap() returns a mapping from glyph index to character code.
This looks like a very bad design decision to me, because several character
codes can correspond to one glyph. For example, in Times New Roman, both 0x32
(space) and 0xA0 (nbsp) are mapped to glyph index 3. Of course, the first one
gets lost in get_charmap().

I think, get_charmap should be fixed to return mapping from character codes to
glyph indices. Alternatively, get_charmap() could be left as it is, and
get_rcharmap() added.

I'm willing to implement either one. Which do you prefer ?

FT2Font.get_charmap() returns a mapping from glyph index to character code.
This looks like a very bad design decision to me, because several character
codes can correspond to one glyph. For example, in Times New Roman, both 0x32
(space) and 0xA0 (nbsp) are mapped to glyph index 3. Of course, the first one
gets lost in get_charmap().

I think, get_charmap should be fixed to return mapping from character codes to
glyph indices. Alternatively, get_charmap() could be left as it is, and
get_rcharmap() added.

I agree with you. I've already posted something about this issue some
time ago: Re: [Phpwiki-talk] Issues with Wiki Dump and Garbage in page_data | PhpWiki

I'm willing to implement either one. Which do you prefer ?

I think we should prefer the first alternative: I've made a quick grep
through matplotlib's code and I've observed that each time get_charmap
is called, the returned dict is never used as is, but immediately
reversed.

-- Nicolas Grilly

···

On 2/14/07, Evgeniy Stepanov <eugeni.stepanov@...149...> wrote:

I also prefer the first way. Here is the patch. Please re-check at least the
changes to mathtext.py, I could miss something. mathtext_demo.py still works,
but it obviously does not test all the changes.

1.patch (7.43 KB)

···

On Wednesday 14 February 2007 21:53, Nicolas Grilly wrote:

On 2/14/07, Evgeniy Stepanov <eugeni.stepanov@...149...> wrote:
> FT2Font.get_charmap() returns a mapping from glyph index to character
> code. This looks like a very bad design decision to me, because several
> character codes can correspond to one glyph. For example, in Times New
> Roman, both 0x32 (space) and 0xA0 (nbsp) are mapped to glyph index 3. Of
> course, the first one gets lost in get_charmap().
>
> I think, get_charmap should be fixed to return mapping from character
> codes to glyph indices. Alternatively, get_charmap() could be left as it
> is, and get_rcharmap() added.

I agree with you. I've already posted something about this issue some
time ago: Re: [Phpwiki-talk] Issues with Wiki Dump and Garbage in page_data | PhpWiki

> I'm willing to implement either one. Which do you prefer ?

I think we should prefer the first alternative: I've made a quick grep
through matplotlib's code and I've observed that each time get_charmap
is called, the returned dict is never used as is, but immediately
reversed.

Thanks for looking into this -- last time Nicolas brought this up back
in November, Paul argued that reversing the dictionary "violated the
principle of least surprise" but clearly you two disagree. If Paul is
still monitoring this, he can weig in again if he still objects to the
reversal. You should try tunning examples/backend_driver and looking
at as many of the PS and PNG outputs as you can to make sure the text
looks right, and then send on a final patch if any revisions are
needed and one of us can see to it that it gets incorporated.

JDH

···

On 2/14/07, Evgeniy Stepanov <eugeni.stepanov@...149...> wrote:

I also prefer the first way. Here is the patch. Please re-check at least the
changes to mathtext.py, I could miss something. mathtext_demo.py still works,
but it obviously does not test all the changes.

John,

I still feel this way, but maybe I should change my tune and let the
changes go in.

-- Paul

···

On 2/14/07, John Hunter <jdh2358@...149...> wrote:

On 2/14/07, Evgeniy Stepanov <eugeni.stepanov@...149...> wrote:

> I also prefer the first way. Here is the patch. Please re-check at least the
> changes to mathtext.py, I could miss something. mathtext_demo.py still works,
> but it obviously does not test all the changes.

Thanks for looking into this -- last time Nicolas brought this up back
in November, Paul argued that reversing the dictionary "violated the
principle of least surprise" but clearly you two disagree. If Paul is
still monitoring this, he can weig in again if he still objects to the
reversal. You should try tunning examples/backend_driver and looking
at as many of the PS and PNG outputs as you can to make sure the text
looks right, and then send on a final patch if any revisions are
needed and one of us can see to it that it gets incorporated.

JDH

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Matplotlib-devel mailing list
Matplotlib-devel@lists.sourceforge.net
matplotlib-devel List Signup and Options

I still feel this way, but maybe I should change my tune and let the
changes go in.

What do you think about the comments made earlier in this thread:

FT2Font.get_charmap() returns a mapping from glyph index to character code.
This looks like a very bad design decision to me, because several character
codes can correspond to one glyph. For example, in Times New Roman, both 0x32
(space) and 0xA0 (nbsp) are mapped to glyph index 3. Of course, the first one
gets lost in get_charmap().

I don't remember why we did it this way originally, or if was you or I
who did it, but if it is correct that the mapping is sometimes many
codes point to one one glyph index, but there each glyph index must
point to a single character code (the latter must be correct, right?)
then reversing it seems to be the right course. But it's been a long
time since I delved into freetype internals ...

JDH

···

On 2/14/07, Paul Barrett <pebarrett@...149...> wrote:

> I still feel this way, but maybe I should change my tune and let the
> changes go in.

What do you think about the comments made earlier in this thread:

My first reply:

I suggest that this patch not be applied, since this was the intended
behavior when the font manager was implemented. The standard behavior
for indicating a missing character is to print a square. In addition,
if a space is printed, how will you know when the formatting is
correct or not. The unanticipated space could mean font is missing
that character, or the layout manager has a bug?

and second reply:

If my memory ser ves me correctly - or if the implementation has
changed over the past few years - get_charmap() is a wrapper on the
FreeType method. FreeType had no reverse mapping and creating one may
have caused problems later.

I prefer the second alternative. If FreeType now has a reverse
mapping, then by all means create a wrapper for it. If not, then you
will need to take some care that get_rcharmap is reasonably future
proof, so that it does cause maintenance problem later on.

> FT2Font.get_charmap() returns a mapping from glyph index to character code.
> This looks like a very bad design decision to me, because several character
> codes can correspond to one glyph. For example, in Times New Roman, both 0x32
> (space) and 0xA0 (nbsp) are mapped to glyph index 3. Of course, the first one
> gets lost in get_charmap().

I don't remember why we did it this way originally, or if was you or I
who did it, but if it is correct that the mapping is sometimes many
codes point to one one glyph index, but there each glyph index must
point to a single character code (the latter must be correct, right?)
then reversing it seems to be the right course. But it's been a long
time since I delved into freetype internals ...

I think I did it. At the time the reverse mapping seemed the best
approach, since this ultimately is what the code demanded. (I guess
my memory has failed me!) We also did not have any examples of the
many to one mapping. As you state, this has now changed and the
latter must be correct. This now explains the FreeType implementation.

-- Paul

···

On 2/14/07, John Hunter <jdh2358@...149...> wrote:

On 2/14/07, Paul Barrett <pebarrett@...149...> wrote:

I don't remember why we did it this way originally, or if was you or I
who did it, but if it is correct that the mapping is sometimes many
codes point to one one glyph index, but there each glyph index must
point to a single character code (the latter must be correct, right?)
then reversing it seems to be the right course. But it's been a long
time since I delved into freetype internals ...

FreeType 2 documentation states very clearly that character codes are
mapped to glyph indices, not the opposite.

I think the mapping from glyph indices to character codes is useless,
and I've not seen any use case of that in matplotlib.

The page FreeType Tutorial / I says:

- "A face object contains one or more tables, called charmaps, that
are used to convert character codes to glyph indices."

- "To convert a Unicode character code to a font glyph index, we use
FT_Get_Char_Index, as in "glyph_index = FT_Get_Char_Index( face,
charcode );". This will look the glyph index corresponding to the
given charcode in the charmap that is currently selected for the
face."

The page http://www.freetype.org/freetype2/docs/reference/ft2-base_interface.html
says:

- "FT_Get_First_Char: This function is used to return the first
character code in the current charmap of a given face. It also returns
the corresponding glyph index."

- "FT_Get_Next_Char: This function is used to return the next
character code in the current charmap of a given face following the
value 'char_code', as well as the corresponding glyph index."

Thanks,

-- Nicolas

My first reply:

I suggest that this patch not be applied, since this was the intended
behavior when the font manager was implemented. The standard behavior
for indicating a missing character is to print a square. In addition,
if a space is printed, how will you know when the formatting is
correct or not. The unanticipated space could mean font is missing
that character, or the layout manager has a bug?

I agree with that. The character name .notdef exists for that purpose
and should be represented by a square or a question mark or something
else, depending on the viewing application. Therefore, the character
name .notdef should not be ignored or replaced by a space.

and second reply:

If my memory serves me correctly - or if the implementation has
changed over the past few years - get_charmap() is a wrapper on the
FreeType method. FreeType had no reverse mapping and creating one may
have caused problems later.

Matplotlib's method get_charmap is a wrapper around FreeType's
functions FT_Get_First_Char and FT_Get_Next_Char. These functions are
designed to map character codes to glyph indices, nothing else. But
our method get_charmap does the opposite, which seems strange.

I prefer the second alternative. If FreeType now has a reverse
mapping, then by all means create a wrapper for it. If not, then you
will need to take some care that get_rcharmap is reasonably future
proof, so that it does cause maintenance problem later on.

To my knowledge, there is no "reverse mapping" in FreeType. There is
only one mapping: character code -> glyph index.

John wrote:

> I don't remember why we did it this way originally, or if was you or I
> who did it, but if it is correct that the mapping is sometimes many
> codes point to one one glyph index, but there each glyph index must
> point to a single character code (the latter must be correct, right?)
> then reversing it seems to be the right course. But it's been a long
> time since I delved into freetype internals ...

1 character code maps to exactly 1 glyph index. I think the opposite
assumpation, i.e. 1 glyph index maps to exactly 1 character code, is
incorrect.

I think I did it. At the time the reverse mapping seemed the best
approach, since this ultimately is what the code demanded. (I guess
my memory has failed me!) We also did not have any examples of the
many to one mapping. As you state, this has now changed and the
latter must be correct. This now explains the FreeType implementation.

Conclusion:
I think we should change the following line in ft2font.cpp from:
    charmap[Py::Int((int) index)] = Py::Long((long) code);
to:
    charmap[Py::Long((long) code)] = Py::Int((int) index);
as proposed by Evgeniy.

This will simplify the few lines of code using it in .py files.

-- Nicolas

···

On 2/14/07, Paul Barrett <pebarrett@...149...> wrote:

I agree with what you said, I've also never heard of glyph index -> character
code mapping. I've checked all ps and png outputs from examples/, everything
seems ok.

···

On Thursday 15 February 2007 21:41, Nicolas Grilly wrote:

On 2/14/07, Paul Barrett <pebarrett@...149...> wrote:
> My first reply:
>
> I suggest that this patch not be applied, since this was the intended
> behavior when the font manager was implemented. The standard behavior
> for indicating a missing character is to print a square. In addition,
> if a space is printed, how will you know when the formatting is
> correct or not. The unanticipated space could mean font is missing
> that character, or the layout manager has a bug?

I agree with that. The character name .notdef exists for that purpose
and should be represented by a square or a question mark or something
else, depending on the viewing application. Therefore, the character
name .notdef should not be ignored or replaced by a space.

> and second reply:
>
> If my memory serves me correctly - or if the implementation has
> changed over the past few years - get_charmap() is a wrapper on the
> FreeType method. FreeType had no reverse mapping and creating one may
> have caused problems later.

Matplotlib's method get_charmap is a wrapper around FreeType's
functions FT_Get_First_Char and FT_Get_Next_Char. These functions are
designed to map character codes to glyph indices, nothing else. But
our method get_charmap does the opposite, which seems strange.

> I prefer the second alternative. If FreeType now has a reverse
> mapping, then by all means create a wrapper for it. If not, then you
> will need to take some care that get_rcharmap is reasonably future
> proof, so that it does cause maintenance problem later on.

To my knowledge, there is no "reverse mapping" in FreeType. There is
only one mapping: character code -> glyph index.

John wrote:
> > I don't remember why we did it this way originally, or if was you or I
> > who did it, but if it is correct that the mapping is sometimes many
> > codes point to one one glyph index, but there each glyph index must
> > point to a single character code (the latter must be correct, right?)
> > then reversing it seems to be the right course. But it's been a long
> > time since I delved into freetype internals ...

1 character code maps to exactly 1 glyph index. I think the opposite
assumpation, i.e. 1 glyph index maps to exactly 1 character code, is
incorrect.

> I think I did it. At the time the reverse mapping seemed the best
> approach, since this ultimately is what the code demanded. (I guess
> my memory has failed me!) We also did not have any examples of the
> many to one mapping. As you state, this has now changed and the
> latter must be correct. This now explains the FreeType implementation.

Conclusion:
I think we should change the following line in ft2font.cpp from:
    charmap[Py::Int((int) index)] = Py::Long((long) code);
to:
    charmap[Py::Long((long) code)] = Py::Int((int) index);
as proposed by Evgeniy.

This will simplify the few lines of code using it in .py files.

and second reply:

If my memory ser ves me correctly - or if the implementation has
changed over the past few years - get_charmap() is a wrapper on the
FreeType method. FreeType had no reverse mapping and creating one may
have caused problems later.

I prefer the second alternative. If FreeType now has a reverse
mapping, then by all means create a wrapper for it. If not, then you
will need to take some care that get_rcharmap is reasonably future
proof, so that it does cause maintenance problem later on.

(...)

I think I did it. At the time the reverse mapping seemed the best
approach, since this ultimately is what the code demanded. (I guess
my memory has failed me!) We also did not have any examples of the
many to one mapping. As you state, this has now changed and the
latter must be correct. This now explains the FreeType implementation.

-- Paul

I used the wayback machine to search the FreeType docs. See:
http://web.archive.org/web/19990302062419/www.freetype.org/docs/user.txt

In 1999 (FreeType 1) you had to use the same approach as today -
convert character code to glyph index, not vice versa. From the above
file:

···

On 2/14/07, Paul Barrett <pebarrett@...149...> wrote:

(...)
g. Load the glyph:

    The glyph loader is easily queried through TT_Load_Glyph().
    This API function takes several arguments:

    o An instance handle to specify at which point size and
      resolution the loaded glyph should be scaled and grid-fitted.

    o A glyph container, used to hold the glyph's data in memory.
      Note that the instance and the glyph must relate to the _same_
      font file. An error would be produced immediately otherwise.

    o A glyph index, used to reference the glyph within the font
      file. This index is not a platform specific character code,
      and a character's glyph index may vary from one font to
      another. To compute glyph indexes from character codes, use
      the TT_CharMap handle created in section (f.) with
      TT_Char_Index().

      We strongly recommend using the Unicode charmap whenever
      possible.
(...)

From the FAQ (same year):

===
25. Does FreeType support "foreign languages"?

  Short Answer: YES, it does!

  From a TrueType font file point of view, there are several parts
  to the file, one of them being the 'glyphs', i.e. picture
  representation of the symbols.

  Another part is the mapping table, also called "charMap".

  For example, glyph #1 could be letter "A", and glyph #2 could be
  letter "Z". Glyphs can be stored in any order in a font file.

  The mapping tables contains at least one char-map entry. For
  example, you could have an ASCII-map that maps 0x41 to glyph #1,
  and 0x5A to glyph #2, etc. FreeType provides a "charMap" object
  class to access and use this information easily.

  There are several character encodings recognized and defined by
  the TrueType specification, like Latin-1, Unicode, Apple Scripts,
  WGL, etc., but a font file might only contain one or two of them.

  When using a more 'exotic' character encoding, like EBCDIC (this
  is IBM mainframe stuff!), you would need to translate it to one of
  the available formats (or to add a charmap table to the font).
  Cf. section 8.

From the above it's clear that FreeType *never* explicitly supported

the glyph->char mapping, but exactly the opposite.

In conclusion, I agree with Nicolas' proposition to change get_charmap
to do what it *should* do, map chars to glyph indexes.

If others agree, I could try to make the changes to SVN this weekend.

Best,
Edin

That's fine with me.

-- Paul

···

On 2/16/07, Edin Salkovic <edin.salkovic@...149...> wrote:

On 2/14/07, Paul Barrett <pebarrett@...149...> wrote:
> and second reply:
>
> If my memory ser ves me correctly - or if the implementation has
> changed over the past few years - get_charmap() is a wrapper on the
> FreeType method. FreeType had no reverse mapping and creating one may
> have caused problems later.
>
> I prefer the second alternative. If FreeType now has a reverse
> mapping, then by all means create a wrapper for it. If not, then you
> will need to take some care that get_rcharmap is reasonably future
> proof, so that it does cause maintenance problem later on.
>

(...)

> I think I did it. At the time the reverse mapping seemed the best
> approach, since this ultimately is what the code demanded. (I guess
> my memory has failed me!) We also did not have any examples of the
> many to one mapping. As you state, this has now changed and the
> latter must be correct. This now explains the FreeType implementation.
>
> -- Paul
>

I used the wayback machine to search the FreeType docs. See:
http://web.archive.org/web/19990302062419/www.freetype.org/docs/user.txt

In 1999 (FreeType 1) you had to use the same approach as today -
convert character code to glyph index, not vice versa. From the above
file:

(...)
g. Load the glyph:

    The glyph loader is easily queried through TT_Load_Glyph().
    This API function takes several arguments:

    o An instance handle to specify at which point size and
      resolution the loaded glyph should be scaled and grid-fitted.

    o A glyph container, used to hold the glyph's data in memory.
      Note that the instance and the glyph must relate to the _same_
      font file. An error would be produced immediately otherwise.

    o A glyph index, used to reference the glyph within the font
      file. This index is not a platform specific character code,
      and a character's glyph index may vary from one font to
      another. To compute glyph indexes from character codes, use
      the TT_CharMap handle created in section (f.) with
      TT_Char_Index().

      We strongly recommend using the Unicode charmap whenever
      possible.
(...)

From the FAQ (same year):

25. Does FreeType support "foreign languages"?

  Short Answer: YES, it does!

  From a TrueType font file point of view, there are several parts
  to the file, one of them being the 'glyphs', i.e. picture
  representation of the symbols.

  Another part is the mapping table, also called "charMap".

  For example, glyph #1 could be letter "A", and glyph #2 could be
  letter "Z". Glyphs can be stored in any order in a font file.

  The mapping tables contains at least one char-map entry. For
  example, you could have an ASCII-map that maps 0x41 to glyph #1,
  and 0x5A to glyph #2, etc. FreeType provides a "charMap" object
  class to access and use this information easily.

  There are several character encodings recognized and defined by
  the TrueType specification, like Latin-1, Unicode, Apple Scripts,
  WGL, etc., but a font file might only contain one or two of them.

  When using a more 'exotic' character encoding, like EBCDIC (this
  is IBM mainframe stuff!), you would need to translate it to one of
  the available formats (or to add a charmap table to the font).
  Cf. section 8.

From the above it's clear that FreeType *never* explicitly supported
the glyph->char mapping, but exactly the opposite.

In conclusion, I agree with Nicolas' proposition to change get_charmap
to do what it *should* do, map chars to glyph indexes.

If others agree, I could try to make the changes to SVN this weekend.

Best,
Edin

Thanks Paul, Edin and Evgeniy.

-- Nicolas

···

On 2/17/07, Paul Barrett <pebarrett@...149...> wrote:

That's fine with me.

On 2/16/07, Edin Salkovic <edin.salkovic@...149...> wrote:
> From the above it's clear that FreeType *never* explicitly supported
> the glyph->char mapping, but exactly the opposite.
>
> In conclusion, I agree with Nicolas' proposition to change get_charmap
> to do what it *should* do, map chars to glyph indexes.
>
> If others agree, I could try to make the changes to SVN this weekend.