text bbox, postscript formatting problem

John_Hunter1 · May 6, 2005, 3:44am

Let me make sure I understand this. If we map mathtext

    > characters to unicode, and use freefont for now, will that
    > help prepare MPL for STIX fonts? If there is an option
    > available now that moves MPL in the direction of a
    > permanent solution, then it seems like the decision is
    > already made.

What follows is a long post of getting unicode fonts to work with
mathtext, which is a very important goal. But there is another goal
which is also important that may serve your thesis needs well: the
ability to farm out text handling to TeX/LaTeX, either for ps or png
using dvitopng.

Now, on to the unicode question.

In principle we should be able to substitute any set of unicode fonts
with any other, since they will all use the same encoding. Last time
I looked into replacing the bakoma fonts I spent a while looking at
the umbelleek fonts, but I came to the conclusion that they do not use
a unicode encoding, despite their author's later advocacy of unicode

http://www.tug.org/TUGboat/Articles/tb19-3/tb60kinch.pdf

So I think freefont is a better path to pursue (I wasn't aware of
these until reading Baptiste's post); even though they are GPLd, they
will ease the path to integration with other unicode fontsets later

    > Can we come up with some kind of a plan or
    > design document for what steps we need to take? I will
    > pick at it after work, if I understand what needs to be
    > done.

I am happy to help, offer advice and pointers and so on, but there is
no definite set of steps I can lay out. The person who has their
boots on wading through the mud will have to make many of these
decisions. There is no 1-to-1 mapping between TeX symbols and
unicode. Most unicode symbols (ancient cypriot) have no TeX
equivalent and many TeX symbols have no unicode equivalent (eg there
is no unicode symbol for each of \sqrt, \Sqrt, \SQRT)

So some creative ways to handles these cases will have to be devised;
a good start would be to google search

tex unicode

and do a little reading to get the lay of the land. There have been
previous efforts at mapping characters between TeX and unicode, and
I've worked on this before (see below). Also, search the archives for
any posts by Robert Kern on the issue of mathtext --- they are all
filled with sage advice and wonderful links that you will never find
even if you google for 1000 years. Unfortunately, the sourceforge
search engine is as sucky as their stats engine, so finding these
posts may be difficult.

> Now that the new formatter is complete, I have to find new
> ways to procrastinate. I will defend in August.

Hmm, in my experience, the having nothing to do is only the 2nd best
motivator for working on an open source project. The best one, of
course, is having a dissertation you should be working on. I'll try
and keep up with you

Included below is a hodge-podge of some stuff I drudged out of my
examples and test directories related to fonts, mathtext and unicode
-- collectively they provide the tools required to put all these
pieces together.

The following is a script to parse a unicode -> text mapping found at
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/ -- grab the file TeX.txt
and run this script on it. The code parses that file builds a
dictionary from TeX->unicode

items =

for line in file('TeX.txt'):
    line = line.strip()
    if not line.find('\\\\'): continue
    vals = line.split('\t')
    for val in vals:
        tup = val.split(' ')
        if len(tup)!=2: continue
        code, sym = tup
        if not sym.startswith('\\'): continue
        items.append((sym, code))

for k,v in items:
    o = ord(v.decode('utf-8'))

    #print k,v,o, hex(o)
    print " r'%s' : %d," % (k, o)

and generates output like

peds-pc311:~/python/mplsupport/test> python parse_tex.py
    r'\alpha' : 945,
    r'\iota' : 953,
    r'\varrho' : 1009,
    r'\beta' : 946,
    r'\kappa' : 954,
    r'\sigma' : 963,

which you can use to create a dictionary mapping tex syms to unicode
indices. You can save this dictionary as a _mathtext_data dict, for
use by the mathtext module.

The next task is to take a set of fonts and build a mapping from
unicode index to fontname, glyph index. This will require some
mastery of ft2font. Last time I was working on this I wrote
examples/font_indexing.py, mainly as a reminder to myself, on how to
use the module to extract the relevant information from font files,
character names, glyph indexes and character codes. I now wish I had
added more comments <wink>. You may want to try this example, read
over it, and make sure you understand what it is doing (add comments
as you learn and commit the updates to CVS).

Many fonts have multiple character maps. Normally the 0 charmap is
unicode if there is a unicode char map. Let's look at the freefont
files and see how we can use the ft2font to find the font with \alpha
(should be at character code 945 from the results above). Below is
some code I wrote to iterate over a list of ttf files and print the
character codes, glyph indices and character names contained in those
files. I'm running this over all the fonts in the freefont dirs and
grepping for 945 and alpha to eliminate the noise

> python find_unicode_texsyms.py /usr/share/fonts/truetype/freefont/*.ttf|grep 945|grep alpha

produces the following output

  FreeMonoBoldOblique.ttf 0 447 945 alpha
  FreeMonoBoldOblique.ttf 2 447 945 alpha
  FreeMonoBold.ttf 0 612 945 alpha
  FreeMonoBold.ttf 2 612 945 alpha
  FreeMonoOblique.ttf 0 651 945 alpha
  FreeMonoOblique.ttf 2 651 945 alpha
  FreeMono.ttf 0 679 945 alpha
  FreeMono.ttf 2 679 945 alpha
  FreeSansBoldOblique.ttf 0 394 945 alpha
  FreeSansBoldOblique.ttf 2 394 945 alpha
  FreeSansBold.ttf 0 438 945 alpha
  FreeSansBold.ttf 2 438 945 alpha
  FreeSansOblique.ttf 0 457 945 alpha
  FreeSansOblique.ttf 2 457 945 alpha
  FreeSans.ttf 0 570 945 alpha
  FreeSans.ttf 2 570 945 alpha
  FreeSerifBoldItalic.ttf 0 546 945 alpha
  FreeSerifBoldItalic.ttf 2 546 945 alpha
  FreeSerifBold.ttf 0 530 945 alpha
  FreeSerifBold.ttf 2 530 945 alpha
  FreeSerifItalic.ttf 0 527 945 alpha
  FreeSerifItalic.ttf 2 527 945 alpha
  FreeSerif.ttf 0 566 945 alpha
  FreeSerif.ttf 2 566 945 alpha

As mentioned, selecting charmap 0 is suppose to select a unicode
character map, and apparently charmap 2 is such a map. So you have
\alpha in a bunch of different styles (plain, bold, italic, etc -- how
to deal with all of this choice in the context of TeX/mathtext fonts
like rm, it, tt etc is where some of the artistry referred to above
comes in).

Below is the code that generated this output -- hopefully it will give
you some more insight into how to use ft2font [BTW, if you take this
on, it would be really helpful if right now you open a notes file and
start a tutorial to self about what you are learning. I have to
relearn this stuff myself every time I work on it (and I wrote most of
the font code and the examples). There is no better time to write
helpful documentation than when learning. Someone may have to do this
again one day, and that someone may be you!]

import sys, os
from glob import glob
from matplotlib.font_manager import fontManager
from matplotlib.ft2font import FT2Font
from matplotlib.cbook import reverse_dict

for fname in sys.argv[1:]:

    #for fname in fontManager.ttffiles:
    font = FT2Font(fname)
    print 'loaded', fname, font.num_charmaps
    for i in range(font.num_charmaps):
        font.set_charmap(i)
        cmap = font.get_charmap()
        items = cmap.items()
        items.sort()
        fname = os.path.split(fname)[-1]
        for gind, code in items:
            name = font.get_glyph_name(gind)
            print fname, i, gind, code, name

OK, so now we have some mappings from TeX -> unicode and some idea of
how to map unicode symbols tofont names and glyph indices. Another
tool which you can look at to understand font handling and glyph
rendering is in the mpl examples dir. The following builds a standard
font table in a plot window

> ./font_table_ttf.py /usr/share/fonts/truetype/freefont/FreeSans.ttf

This should be enough for tonight :-). We can talk by phone tomorrow
if you think it would help, or you can post questions here. It's good
to get some of this on record. I've spent many hours working on this
problem, but have never had the time and stamina to see it through.
mathtext in matplotlib has a lot of promise but the current
implementation is not satisfactory. Getting a good set of unicode
fonts working would be a significant step forward.

Thanks!
JDH

Robert_Kern · May 6, 2005, 4:09am

John Hunter wrote:

So some creative ways to handles these cases will have to be devised;
a good start would be to google search

tex unicode

Also

tex mathml

since the MathML people have to deal with the same problem.

and do a little reading to get the lay of the land. There have been
previous efforts at mapping characters between TeX and unicode, and
I've worked on this before (see below). Also, search the archives for
any posts by Robert Kern on the issue of mathtext --- they are all
filled with sage advice and wonderful links that you will never find
even if you google for 1000 years. Unfortunately, the sourceforge
search engine is as sucky as their stats engine, so finding these
posts may be difficult.

But Gmane is pretty good.

http://search.gmane.org/search.php?query=tex&email=kern&group=gmane.comp.python.matplotlib.general&sort=date

And couple useful tidbits that I sent privately:

"""There's some information here to go from various TeX font encodings to Unicode:

They claim some parts of it are GPLed from the catdvi project. I'm not sure if one can copyright this kind of information; how much creative choice is involved? But it might be worth asking the relevant individuals for permission.
"""

"""If you want to use the algorithms from _TeX: The Program_, I would suggest that you take a look at the publications of Luca Padovani, the implementor of GtkMathView. He describes how he uses, more or less, the TeX algorithms without the extra information provided by fonts designed for use in TeX.

MathML formatting with TeX rules, TeX fonts, and TeX quality

PhD thesis: MathML Formatting
http://www.cs.unibo.it/~lpadovan/phd/main.pdf
"""

···

--
Robert Kern
rkern@...170...

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Paul_Barrett · May 6, 2005, 4:22am

John Hunter wrote:

As mentioned, selecting charmap 0 is suppose to select a unicode
character map, and apparently charmap 2 is such a map. So you have
\alpha in a bunch of different styles (plain, bold, italic, etc -- how
to deal with all of this choice in the context of TeX/mathtext fonts
like rm, it, tt etc is where some of the artistry referred to above
comes in).

I would think the font_manager should be able to help you here, at least for the 'rm' and'it' styles. The font_manager tries to provide such information with the .style attribute. The 'tt' style is the big problem, since this requires a fixed width font. TTF fonts don't provide this information. It must be known beforehand or somehow deduced by reading the widths of the characters, if one is to do this in the general sense. The other option is to hardcode it into MPL. You probably know this already, but thought I should mention it anyway.

-- Paul

···

--
Paul Barrett, PhD Space Telescope Science Institute
Phone: 410-338-4475 ESS/Science Software Branch
FAX: 410-338-4767 Baltimore, MD 21218

_Darren_Dale · May 8, 2005, 10:05pm

[...]

The next task is to take a set of fonts and build a mapping from
unicode index to fontname, glyph index. This will require some
mastery of ft2font. Last time I was working on this I wrote
examples/font_indexing.py, mainly as a reminder to myself, on how to
use the module to extract the relevant information from font files,
character names, glyph indexes and character codes. I now wish I had
added more comments <wink>. You may want to try this example, read
over it, and make sure you understand what it is doing (add comments
as you learn and commit the updates to CVS).

Traceback (most recent call last):
File "font_indexing.py", line 36, in ?
print 'AV', font.get_kerning(glyphd['A'], glyphd['V'])/64.0
IndexError: Unexpected SeqBase<T> length.

I had to make a change to font_indexing.py, the call to get_kerning() needed a
kerning mode parameter. This will output information that can be compared
with values reported by fontforge:

print 'AV', font.get_kerning(glyphd['A'], glyphd['V'], KERNING_UNSCALED)
print 'AA', font.get_kerning(glyphd['A'], glyphd['A'], KERNING_UNSCALED)

The results are:
Vera: (same in fontforge)
AV -131
AA 57
VeraSe: (same in fontforge)
AV -102
AA 0
FreeSerif:
AV 0.0 (-131 according to fontforge)
AA 0.0
FreeSans:
AV 0.0 (-75 according to fontforge)
AA 0.0

PostScript output using the freefont afm's also appears to not be kerned,
which I dont understand.

Darren

···

On Thursday 05 May 2005 11:44 pm, John Hunter wrote: