Problems with Unicode in mathtext

After some thorough research on the subject I decided to post my
conclusions/thoughts here. Beware, this is a long one.

Font problems

···

==========
There are no good, complete, free, unicode, Open/TrueType math fonts
currently. We will have to wait for the STIX fonts. On the site it
says that the beta version of the fonts will be available in
september, so probably the next SoC could cover that - if we're lucky
;).

I had a look at the following Open/TrueType unicode fonts:
* CMU fonts. This fonts practicaly don't have any math symbols, so
they're not a solution.
* The fonts used by Open Office - Open Symbol (opens___.ttf), which
has a decent set of symbols (unicode). This fonts were made to play
well with Times, and could be used in mathtext with perhaps Nimbus
Roman fonts.
* FreeFont. GPL fonts, available on any Linux box. They have an
extensive list of supported symbols. Probably the best free TrueType
fonts out there.

The best solution to the problem of good fonts would be using the
currently available CM and AMS (and other) Type1 fonts which are free
and come with every TeX distribution. These fonts are complete, and
have pretty good Unicode support which is ilustrated by the following
code:

from matplotlib.ft2font import FT2Font
import unicodedata

# Path to a Type1 font
filename = r'c:\texmf\fonts\type1\bluesky\symbols\msam10.pfb'

f = FT2Font(filename)
indexes = f.get_charmap()
for index, uni in indexes.items():
    try:
        name = unicodedata.name(unichr(uni))
    except ValueError:
        name = None
    print f.get_glyph_name(index), index, name, repr(unichr(uni))

which outputs

space 128 SPACE u' '
diamond 6 BLACK DIAMOND SUIT u'\u2666'
therefore 41 THEREFORE u'\u2234'
because 42 BECAUSE u'\u2235'
muchless 110 MUCH LESS-THAN u'\u226a'
muchgreater 111 MUCH GREATER-THAN u'\u226b'
dblarrowleft 18 LEFT RIGHT DOUBLE ARROW u'\u21d4'
dblarrowright 19 RIGHTWARDS DOUBLE ARROW u'\u21d2'
lessorgreater 55 LESS-THAN OR GREATER-THAN u'\u2276'
greaterorless 63 GREATER-THAN OR LESS-THAN u'\u2277'
angle 92 ANGLE u'\u2220'
proportional 95 PROPORTIONAL TO u'\u221d'

msam10 font was used in the above code, but other fonts behave similarly.

Unfortunately the most important function in FT2Font class

f.get_glyph(index)

raises

ValueError: Glyph index out of range

for Type1 fonts, but I think that this could be easily fixed.

Current C++ code for get_glyph:
char FT2Font::get_glyph__doc__ =
"get_glyph(num)\n"
"\n"
"Return the glyph object with num num\n"
;
Py::Object
FT2Font::get_glyph(const Py::Tuple & args){
  _VERBOSE("FT2Font::get_glyph");

  args.verify_length(1);
  int num = Py::Int(args[0]);

  if ( (size_t)num >= gms.size())
    throw Py::ValueError("Glyph index out of range");

  //todo: refcount?
  return Py::asObject(gms[num]);
}

The problem with this solution (if we get get_glyph to work with
Type1) could be the backends. Agg wouldn't have to change much (if at
all), but I don't know about the PS and SVG backends. Type 1 fonts are
installable on both windows (via .pfm files) and Unix systems, so I
guess SVG files could be viewed/changed without much hassle, and the
PS backend could be changed a bit to support Type1 fonts.

Also, all the characters are spread around in a pretty large number of
files, but I suppose that with a little code this can be surpassed.

Unicode problems

The following is assembled from the report ¸"Unicode Support for
Mathematics", which is the first source of information regarding
mathematics and Unicode.

The biggest problem with *proper* math Unicode are the "Mathematical
Alphanumeric Symbols", which are found in the 1D400..1D7FF range, not
in the Basic Multilingual Plane. These are not found in any free font.
I also noticed that Python's support for Unicode outside the BMP plane
is not very good. The following example works on Linux (Ubuntu 6.06),
but doesn't work on Windows XP (32):

import unicodedata
unicodedata.name(U'U\U0001d400')

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: need a single Unicode character as parameter

The output should say:
MATHEMATICAL BOLD CAPITAL A

The "Mathematical Alphanumeric Symbols" block contains:
* Mathematical bold letters
* Mathematical italic letters (used for variables, default font in
TeX math mode)
* Mathematical bold italic letters
* Mathematical script (calligraphic) letters
* Mathematical bold script letters
* Mathematical fraktur letters
* Mathematical double-struck letters
* Mathematical bold fraktur letters
* Mathematical sans-serif letters
* Mathematical sans-serif bold letters
* Mathematical sans-serif italic letters
* Mathematical sans-serif bold italic letters
* Mathematical monospace letters
* Dotless symbols
* Bold Greek symbols
* Additional bold Greek symbols
* Italic Greek symbols
* Additional italic Greek symbols
* Bold italic Greek symbols
* Additional bold italic Greek symbols
* Sans-serif bold Greek symbols
* Sans-serif bold italic Greek symbols
* Additional sans-serif bold Greek symbols
* Additional sans-serif bold italic Greek symbols
* Bold digits
* Double-struck digits
* Sans-serif digits
* Sans-serif bold digits
* Monospace digits

These were all put in the Unicode character set because of their
semantic meanings in mathematics, although practically all are just
font variations (<font>). The roman math letters (serif, normal, used
for digits) default to the "Basic Latin" block.

It is interesting to note that the "Mathematical Alphanumeric Symbols"
block doesn't seem to be supported by, for example, Arial Unicode MS
(it supports only the BMP).

This issue cannot be successfully solved until the STIX fonts come
out. If they package them right (and they ought to), we could have a
single .ttf file for all the glyphs needed for mathtext. Until then,
any solution will need some sort of mapping between unicode blocks
(character ranges) and fontfiles (at least for italic, calligraphic
etc. fonts)

Possible enhancements

I think there should be a thin Python wrapper around the FreeType2
FT2Font class. Then, for example, all the caching could be handled by
that class. This would allow not only caching for mathtext, but even
for *plain text* and would clean up code. This would also allow adding
new functionality, without messing around with C++, and without
breaking old code.

One could then, for example, have a FT2Font class method
get_unicode_glyph that would return the glyph based on his unicode
index, or better yet, the next code would be easy implementable:
glyphs = FT2Font('/path/to/font')
glypha = glyphs['a']

or even:
text_to_render = glyphs.text('Some lame text')

or something similar. Again, this would not break old code and would
ease writing new code. However, as John once said:

The font library is probably an SOC project of
it's own, because we would like to settle on one freetype library that
both matplotlib and enthought/chaco can use. How to deal with this
issue without becoming consumed by it will require some thought.

Conclusion

John, what should I do? Please comment.

I think that the best solution right now are unfortunately the BaKoMa
fonts. If we could get the Type1 fonts to work then I could probably
easily ingegrate them into the existing model. I could also try to do
something with the Open Symbol fonts, and the FreeFont (windows users
could dowload them sepparately).

Cheers,
Edin

The STIX fonts have been delayed a number of times, but they just received the
last set of glyphs. This is good news.

···

On Wednesday 12 July 2006 07:10, Edin Salković wrote:

After some thorough research on the subject I decided to post my
conclusions/thoughts here. Beware, this is a long one.

Font problems

There are no good, complete, free, unicode, Open/TrueType math fonts
currently. We will have to wait for the STIX fonts.