LaTeX2MathMLdetails

# LaTeX2MathML

Posted by Ed Leaver on

LaTeX2MathML is a Perl script that combines LaTeX, LaTeX2HTML, and MathToWeb to produce html-with-embedded-mathml from a LaTeX source document. Based on LaTex2HTML, LaTeX2MathML handles LaTeX equation labels, \ref, \cite, and \include commands, and works with multi-node document splits. The script regenerates the original LaTeX math expressions in the LaTeX2HTML html output, which are then converted to MathML by MathToWeb:

• Run PdfLaTeX (via latexmk -dvi) on the original LaTeX document to generate the dvi and aux files required by LaTeX2HTML.
• Run PdfLaTeX (via latexmk -pdf) on the original LaTeX document to generate a PDF file of the document.
• Run LaTeX2HTML on the original document, dvi, and aux files to generate a complete HTML page with embedded bitmap equation images. LaTeX2HTML retains (some of) the original LaTeX math expression in each <IMG> element's ALT field. It also saves the complete original LaTeX math expression in a <!-- MATH (math expression) --> comment somewhere not too far above the <IMG> element that displays it;
• Using the ALT field and comment, substitute the original LaTeX expressions back into LaTeX2HTML's html files in place of the image elements.
• Run MathToWeb on the resulting erzats html-with-embedded-latex-math to translate the latex-math expressions to MathML. This step is optional, because as discussed above one could also use LaTeXMathML or MathJaX or MathML-Chrome JavaScript to do the translation on the client side.
• Optionally insert CSS and/or JavaScript and/or PHP <link>s and <script>s into the HTML files' <head> elements, replacing the file suffix with .php or .pl as appropriate. Note: LaTeX2HTML recognizes many Hypertext Extensions to LaTeX that do this sort of thing and much more: see Chapter 4 of the LaTeX2HTML Manual .
LaTeX2MathML could be simpler. That LaTeX2HTML only retains 160 character subsets of the original LaTeX math expression in each <IMG> element's ALT field is a bit inconvenient, and apparently a holdover from the early DB and DBM databases that LaTeX2HTML uses to store intermediate data. It also removes some redundant whitespace, which makes pattern matching a bit more tricky. Today GDBM is widely and freely available and has no restrictions on the length of its value fields. I haven't yet figured out how to make the necessary changes in LaTeX2HTML; anyone making such an enhancement to LaTeX2HTML should probably consider surrounding the HTML image immediately by another descriptive tag, for instance
 <LaTeX><IMG src="img5282.png" ALT="$$E = mc^2$$"/></LaTeX> 
The present
 <SPAN CLASS="MATH"><IMG src="img5282.png" ALT="$$E = mc^2$$"/></SPAN>  is quite useful and would probably suffice if it were applied everywhere. (It is not always wrapped around simple inline math expressions). One might also consider an option to forego <IMG> tags for LaTeX math expressions, and just write the expression instead of the <IMG> (which is what LaTeX2MathML obtains in a roundabout way). Of course, an option to run mathtoweb on the expression at that time and embed MathML instead of the <IMG> would obviate LaTeX2MathML entirely.

An alternate approach using the current 1.71 version unmodified, would be to run LaTeX2HTML in its debug mode, which retains its database files for posterity. The LaTeX2MathML script could probably then open the relevant DBM file, and extract the information needed to correlate each <IMG> element's SRC field bitmap image name with the corresponding LaTeX expression retained in LaTeX2HTML's images.tex file. Such would probably be more reliable than the present method of searching the <!-- MATH ... --> comments for the expression fragments stored in the ALT field, which conceivably could fail if the trailing ALT fragment were too short to be unique. Always something more to do...

1. Equation labels are handled fine. Citation references in text are handled fine. Citation references inside a math environment are not. I work around these cases with judicious use of \latex{} and \html{} commands, e.g.

 \label{eq4a} t = r\sqrt{\frac{N-2}{1-r^2}} \;\;\; \latex{\cite[eq. (13.7.5) ff.]{NumRecipes}} \html{(see \cite[eq. (13.7.5) ff.]{NumRecipes})} is distributed... 

2. LaTeX (pdflatex) has been hacked over for close to thirty years. It is efficient C code that works extremely well. And LaTeX2HTML has been around the block a few times itself. My LaTeX2MathML script relies upon MathToWeb, which is a relatively recent Java package. It still has some quirks. However, its error reporting is generally good, and its author (Paul Hunter) is responsive to bug reports and suggestions. Those described here are for information, not criticim.
3. Mathtoweb 3.0.2 doesn't yet recognize LaTeX font commands embedded in math expressions, such as \rm, \tiny, \small, and prints them verbatim. A workaround is to disable the size comands e.g.
 \renewcommand{\tiny}{ }  (Note space within second braces), and just place text you want romanized in a math expression inside an \mbox{} or \mathrm{}.
4. Mathtoweb 3.0.2 requires there be no whitespace separating the two fractional arguments:
 \frac{expr1}{expr2}  not \frac{expr1} {expr2}
5. And sometimes MathToWeb 3.0.2 just fails. For example

 \left[ \sum_1^N \bar{X} \right]^2 

produces an obscure "(mo_mrow) A MathML delimiter is not closed correctly at line: unknown" error. I hacked around this one with

 \left[ \sum_1^N \bar{X} \right]\;^2 

which works but doesn't properly place the final exponent. (Again just FYI. Paul promises a fix to all these issues in the next major release.)

6. Even after disabling the 1.5 second sleeps in its driver, mathtoweb is slugglish compared to LaTeX. A document that LaTeX (through latexmk) can process to pdf in less than a second on a fast machine, can take MathToWeb a hundred times longer just to process the LaTeX math expressions to MathML. This is just a matter of convenience, so far I've been impressed that MathToWeb works as well as it does, and expect speed to improve as part of the normal development process. Paul thinks he'll have version 4.0 out sometime this fall. I'll update this page when he does.