Questions about mathtext, unicode conversion etc.

John_Hunter1 · June 15, 2006, 9:06pm

Hi all, Is it that the code in the mathtext module looks

    > ugly or is it just me not understanding it? Also, if anyone
    > has some good online sources about parsing etc. on the net,
    > I vwould realy appreciate it.

It's probably you not understanding it In my opinion, the code is
pretty nice and modular, with a few exceptions, but I'm biased.
Parsers can be a little hard to understand at first. You might start
by trying to understand pyparsing

http://pyparsing.wikispaces.com

and work through some of the basic examples there. Once you have your
head wrapped around that, it will get easier.

> Considering the foowing code (picked on random, from
> mathtext.py)

> I don't understand, for example, what does the statement:

> expression.parseString( s )

> do?

    > "expression" is defined globaly, and is called (that is -
    > its method) only once in the above definition of the
    > function, but I don't understand - what does that particular
    > line do?!?

It's not defined globally, but at module level. There is only one
expression that represents a TeX math expression (at least as far as
mathtext is concerned) so it is right that there is only one of them
at module level. It's like saying "a name is a first name followed by
an optional middle initial followed by a last name". You only need to
define this one, and then you set handlers to handle the different
components.

The expression assigns subexpressions to handlers. The statement
below says that an expression is one or more of a space, font element,
an accent, a symbol, a subscript, etc...

expression = OneOrMore(
space ^ font ^ accent ^ symbol ^ subscript ^ superscript ^ subsuperscript ^ group ^ composite ).setParseAction(handler.expression).setName("expression")

A subscript, for example, is a symbol group followed by an underscore
followed by a symbol group

subscript << Group( Optional(symgroup) + Literal('_') + symgroup )

and the handler is defined as

subscript = Forward().setParseAction(handler.subscript).setName("subscript")

which means that the function handler.subscript will be called every
time the pattern is matched. The tokens will be the first symbol
group, the underscore, and the second symbol group. Here is the
implementation of that function

    def subscript(self, s, loc, toks):
        assert(len(toks)==1)
        #print 'subsup', toks
        if len(toks[0])==2:
            under, next = toks[0]
            prev = SpaceElement(0)
        else:
            prev, under, next = toks[0]

        if self.is_overunder(prev):
            prev.neighbors['below'] = next
        else:
            prev.neighbors['subscript'] = next

return loc, [prev]

This grabs the tokens and assigns them to the names "prev" and "next".
Every element in the TeX expression is a special case of an Element,
and every Element has a dictionary mapping surrounding elements to
relative locations, either above or below or right or superscript or
subscript. The rest of this function takes the "next" element, and
assigns it either below (eg for \Sum_\0) or subscript (eg for x_0) and
the layout engine will then take this big tree and lay it out. See
for example the "set_origin" function?

    > ------ Regarding the unicode s
upport in mathtext, mathtext
    > currently uses the folowing dictionary for getting the glyph
    > info out of the font files:

> latex_to_bakoma = {

    > r'\oint' : ('cmex10', 45), r'\bigodot' : ('cmex10', 50),
    > r'\bigoplus' : ('cmex10', 55), r'\bigotimes' : ('cmex10',
    > 59), r'\sum' : ('cmex10', 51), r'\prod' : ('cmex10', 24),
    > ...
    > }

    > I managed to build the following dictionary(little more left
    > to be done): tex_to_unicode = { r'\S' : u'\u00a7', r'\P' :
    > u'\u00b6', r'\Gamma' : u'\u0393', r'\Delta' : u'\u0394',
    > r'\Theta' : u'\u0398', r'\Lambda' : u'\u039b', r'\Xi' :
    > u'\u039e', r'\Pi' : u'\u03a0', r'\Sigma' : u'\u03a3',

> unicode_to_tex is straight forward. Am I on the right
> track? What should I do next?

Yes, this looks like the right approach. Once you have this
dictionary mostly working, you will need to try and make it work with
a set of unicode fonts. So instead of having the tex symbol point to
a file name and glyph index, you will need to parse a set of unicode
fonts to see which unicode symbols they provide and build a mapping
from unicode name -> file, glyph index. Then when you encounter a tex
symbol, you can use your tex_to_unicode dict combined with your
unicode -> filename, glyphindex dict to get the desired glyph.

    > I also noticed that some TeX commands (commands in the sense
    > that they can have arguments enclosed in brackets {}) are
    > defined as only symbols: \sqrt alone, for example, displays
    > just the begining of the square root:?, and \sqrt{123}
    > triggers an error.

We don't have support for \sqrt{123} because we would need to do
something a little fancier (draw the horizontal line over 123). This
is doable and would be nice. To implement it, one approach would be
add some basic drawing functionality to the freetype module, eg to
tell freetype to draw a line on it's bitmap. Another approach would
simply be to grab the bitmap to freetype and pass it off to agg and
use the agg renderer to decorate it. This is probably preferable.

But I think this is a lower priority right now.

JDH