Unicode to Tex symbols, Type1 names, and vice versa

I finally solved the problem of automaticaly generating the

    > dicts for unicode <-> TeX conversion. This is the first step
    > in enabling unicode support in mathtext.

Excellent.

    > The STIX projects is usefull after all :wink: They keep a nice
    > table of Unicode symbols at:
    > http://www.ams.org/STIX/bnb/stix-tbl.ascii-2005-09-24

    > Any comments about the script are appreciated :). Now I'll

Since you asked :slight_smile:

I may not have mentioned this but the style conventions for mpl code
are

  functions : lower or lower_score_separated
  variables and attributes : lower or lowerUpper
  classes : Upper or MixedUpper

Also, I am not too fond of the dict of dicts -- why not use variable
names? Here is my version

    import pickle

    fname = 'stix-tbl.ascii-2005-09-24'

    uni2type1 = dict()
    type12uni = dict()
    uni2tex = dict()
    tex2uni = dict()

    for line in file(fname):
        if line[:2]!=' 0': continue # using continue avoids unneccesary indent

        uninum = line[2:6].strip().lower()
        type1name = line[12:37].strip()
        texname = line[83:110].strip()

        uninum = int(uninum, 16)
        if type1name:
            uni2type1[uninum] = type1name
            type12uni[type1name] = uninum
        if texname:
            uni2tex[uninum] = texname
            tex2uni[texname] = uninum
    pickle.dump((uni2type1, type12uni, uni2tex, tex2uni), file('unitex.pcl','w'))

    # An example
    unichar = int('00d7', 16)
    print uni2tex.get(unichar)
    print uni2type1.get(unichar)

Also, I am a little hesitant to use pickle files for the final
mapping. I suggest you write a script that generates the python code
contains the dictionaries you need (that is how much of _mathext_data
was generated.

Thanks,
JDH

Since you asked :slight_smile:

I may not have mentioned this but the style conventions for mpl code
are

  functions : lower or lower_score_separated
  variables and attributes : lower or lowerUpper
  classes : Upper or MixedUpper

OK

Also, I am not too fond of the dict of dicts -- why not use variable
names?

I used a dict of dicts because this allowed me to generate separate
picle files (for each one of the dicts in the top-level dict) and
anything else (see the final script) by their coresponding top-level
dict name. I thought it was better, for practical/speed reasons, to
have separate pickle files, for every dict.

    for line in file(fname):
        if line[:2]!=' 0': continue # using continue avoids unneccesary indent

Thanks for the tip!

        uninum = line[2:6].strip().lower()
        type1name = line[12:37].strip()
        texname = line[83:110].strip()

        uninum = int(uninum, 16)

I thought that the idea was to allow users to write unicode strings
directly in TeX (OK, this isn't much of an excuse :). That's why I
used the eval approach, to get the dict keys (or values) to be unicode
strings. I'm also aware that indexing by ints is faster, and that the
underlying FT2 functions work with ints... OK, I'm now convinced that
your approach is better :slight_smile:

    pickle.dump((uni2type1, type12uni, uni2tex, tex2uni), file('unitex.pcl','w'))

    # An example
    unichar = int('00d7', 16)
    print uni2tex.get(unichar)
    print uni2type1.get(unichar)

Also, I am a little hesitant to use pickle files for the final
mapping. I suggest you write a script that generates the python code
contains the dictionaries you need (that is how much of _mathext_data
was generated.

The reason why I used pickle - from the Python docs:

···

On 6/22/06, John Hunter <jdhunter@...5...> wrote:

Strings can easily be written to and read from a file. Numbers take a
bit more effort, since the read() method only returns strings, which
will have to be passed to a function like int(), which takes a string
like '123' and returns its numeric value 123. However, when you want
to save more complex data types like lists, dictionaries, or class
instances, things get a lot more complicated.

Rather than have users be constantly writing and debugging code to
save complicated data types, Python provides a standard module called
pickle. This is an amazing module that can take almost any Python
object (even some forms of Python code!), and convert it to a string
representation; this process is called pickling. Reconstructing the
object from the string representation is called unpickling. Between
pickling and unpickling, the string representing the object may have
been stored in a file or data, or sent over a network connection to
some distant machine.

So I thought that pickling was the obvious way to go. And, of course,
unpickling with cPickle is very fast. I also think that no human being
should change the automaticaly generated dicts. Rather, we should put
a separate python file (i.e. _mathtext_manual_data.py) where anybody
who wants to manually override the automaticaly generated values, or
add new (key, value) pairs can do so.

The idea:

_mathtext_manual_data.py:

uni2text = {key1:value1, key2:value2}
tex2uni = {}
uni2type1 = {}
type12uni = {}

uni2tex.py:

from cPickle import load

uni2tex = load(open('uni2tex.cpl'))
try:
    import _mathtext_manual_data
    uni2tex.update(_mathtext_manual_data.uni2tex)
except (TypeError, SyntaxError): # Just these exceptions should be raised
    raise
except: # All other exceptions should be silent
    pass

Finally, I added lines for automatically generating pretty much
everything that can be automatically generated

stix-tbl2py.py

'''A script for seemlesly copying the data from the stix-tbl.ascii*
file to a set
of python dicts. Dicts are then pickled to coresponding files, for
later retrieval.
Currently used table file:
http://www.ams.org/STIX/bnb/stix-tbl.ascii-2005-09-24
'''

import pickle

tablefilename = 'stix-tbl.ascii-2005-09-24'
dictnames = ['uni2type1', 'type12uni', 'uni2tex', 'tex2uni']
dicts = {}
# initialize the dicts
for name in dictnames:
    dicts[name] = {}

for line in file(tablefilename):
    if line[:2]!=' 0': continue
    uninum = int(line[2:6].strip().lower(), 16)
    type1name = line[12:37].strip()
    texname = line[83:110].strip()
    if type1name:
        dicts['uni2type1'][uninum] = type1name
        dicts['type12uni'][type1name] = uninum
    if texname:
        dicts['uni2tex'][uninum] = texname
        dicts['tex2uni'][texname] = uninum

template = '''# Automatically generated file.
from cPickle import load

%(name)s = load(open('%(name)s.pcl'))
try:
    import _mathtext_manual_data
    %(name)s.update(_mathtext_manual_data.%(name)s)
except (TypeError, SyntaxError): # Just these exceptions should be raised
    raise
except: # All other exceptions should be silent
    pass
'''

# pickling the dicts to corresponding .pcl files
# automatically generating .py module files, used by importers
for name in dictnames:
    pickle.dump(dicts[name], open(name + '.pcl','w'))
    file(name + '.py','w').write(template%{'name':name})

# An example
from uni2tex import uni2tex
from uni2type1 import uni2type1

unichar = u'\u00d7'
uninum = ord(unichar)
print uni2tex[uninum]
print uni2type1[uninum]

Cheers,
Edin