Translating LaTeX Documents to Html/MathML

Posted by Ed Leaver on

In a previous incarnation, back in the early pre-dns daze of the Internet, I dabbled in some moderately obscure mathematics of my own, eventually publishing a few modestly-received journal articles. These were prepared using the LaTeX document preparation program, which I found rather convenient for the purpose. I was hardly alone, and since then LaTeX has become the common standard for publishing research articles and textbooks involving mathematics.

A few years later, longtime colleagues Nelson Beebe and Pieter Bowman and I attended a public seminar conducted by Prof. Larry Smarr -- no mean math hack himself -- wherein he introduced some of the facinating work he and his staff at the nascent National Center for Supercomuting Applications, in cooperation with accomplices at CERN, were conducting on interactive graphical-based Internet communication, the groundwork of what was to become the World Wide Web.

It was exciting stuff, and as we departed the auditorium, Pieter -- a sysadm with the local Math Department -- quietly voiced a common concern: "Yes. But where does one get the bandwidth?"

The requisite bandwidth soon materialized, and with it HTML and near-realtime downloads of pdf documents, music rips, and all sorts of fascinating visual imagery. But direct display of mathematical expressions in HTML? The stuff that was the heart and soul of nearly every one of the Internet and World Wide Web's collective founders? Not so much. One could download all the, ah, images one's little heart desired, legal or otherwise, but to display a mathematical article required the formulae first be rendered into bitmapped images, usually at low resolution and just as course and grainy. Donald Knuth must have laughed. Sadly perhaps, but laughed nonetheless. However...

LaTeX2HTML has been available for at least the past fifteen years and does a fairly decent job at translating even modestly complex LaTeX documents to HTML. It famously displays math expressions as bitmaps.

TeX4HT works somewhat similarly to LaTeX2HTML, but differs in philosophy. While LaTeX2HTML recognizes a rich set of macros from the html package, TeX4HT encourages making no modifications to the author's original TeX or LaTeX source files in favor of adding custom functionality via configuration files. However, source file LaTeX macros are available via the tex4ht package should one desire them. I haven't used TeX4HT much yet, but so far haven't required either. TeX4HT has the powerful attraction of optionally generating MathML markup instead of embedded bitmap <IMG> tags. This actually works quite well, though (as always) there are a few nits:

  1. TeX4HT (mzlatex) drops subscripts and superscripts from LaTeX \newcommand macros.
  2. I haven't yet been able to figure out how to tell mzlatex not to change the image files specified in an \includegraphics macro.
Both are easily fixed in the LaTeX4HT driver script by scanning \newcommands and \renewcommands from the LaTeX input stream and applying them prior to calling mzlatex, and concurrently translating \includegraphics to \Picture. Neither are a big issue.

LaTeXMathML is a JavaScript that translates (simple) LaTeX math expressions embedded in (X)HTML to MathML, though it has difficulty with complicated expressions and more extended non-mathematical portions of LaTeX documents. For instance, the present version appears to have some difficulty with tabular and verbatim environments. If they are present LaTeXMathML processes only those portions of an HTML document enclosed in <LaTeX> ... </LaTeX> tags, and these may enclose not just the LaTeX math environments themselves, but also portions of the document that reference them through LaTeX \label and \ref commands. Although limited in the expressions it can correctly convert, LaTeXMathML can also do simple sectioning - see its user's guide.

MathJax is another client-side JavaScript set up for easy hosting by web servers, and is also available as a plugin for Google's Chromium browser (MathML-Chrome: see below). MathJax is a combined LaTeX math expression translator and MathML rendering script, enabling web pages that embed either LaTeX or MathML to be displayed on most browsers at the cost of some download and rendering time when the web page is accessed. It is also available as a plugin for blog server packages such as Wordpress and Blogspot/Blogger. MathJax is well supported, and probably the closest there is to "official" LaTeX and MathML support for most browsers. MathJax communicates with a font server to ensure it has math fonts needed for rendering. In contrast, although Firefox supports MathML natively, font installation is ultimately up to the user. From Fonts for Mozilla's MathML engine:

"...on two similar systems, while one user may see a document rendered correctly, the other user may see something completely different. To see MathML as intended, you need sufficient font support, which may mean installing some fonts. Just having a MathML-enabled browser is not necessarily enough."
See About Your Browser. Until there is more general MathML support in the popular web browsers, with needed fonts included in their general distribution, MathJax may be the best compromise to get the most math to the widest audience.

MathML-Chrome. Google's Chromium browser has an available "Math Anywhere" extension based on MathJax that, like LaTexMathML, does client side interpretation of LaTeX math expressions to MathML, and also renders already-present MathML to the browser screen. I haven't used it much, but at least at rendering already-present MathML, MathML-Chrome does not appear quite as robust against minor MathML errors as Firefox's built-in MathML interpreter. (To be updated after more experience with MathJax and MathML-Chrome.)

MathToWeb is a very useful Java program that translates even complicated LaTeX math expressions (embedded in HTML) to MathML directly as part of one's document development flow. Not by the web server or on the web browser client, but rather in the safety and comfort of one's own desktop, before one even uploads the document to the web. MathToWeb 3.0.2 purports to recognize LaTeX inline math constructs and all modern AMS-TeX displaymath environments -- although somewhat incongruously not the archaic LaTeX displaymath and eqnarray environments themselves. (To be fixed in version 4.0) Like TeX4HT, MathToWeb requires browser support for MathML and its fonts. Unlike TeX4HT, MathToWeb operates just on the LaTeX math expressions in an otherwise-html document, so is well suited for authors who prefer to write native html but express their mathematics in LaTeX.

LaTeX2MathML is a Perl script combining LaTeX, LaTeX2HTML, and MathToWeb. Based on LaTeX2HTML, LaTeX2MathML translates an entire LaTeX document, handling equation labels, \ref, \cite, and \include commands, and works with multi-node document splits. The script first regenerates the original LaTeX math expressions in the LaTeX2HTML html output. These are then converted to MathML by MathToWeb. I've used it to prepare several of the articles on this site.

More recently, I've written a similar LaTeX4HT script based on TeX4HT, which so far promises to require far fewer html-specific tweaks to the LaTeX source document files than does LaTeX2MathML. Also, since TeX4HT produces MathML output directly, the LaTeX4HT script itself is much simpler and easy to maintain. I anticipate using this TeX4HT-based script extensively, and will update this section as LaTeX4HT matures.

To conclude, I've outlined a few of the methods available to render LaTeX mathematical expressions in HTML documents. Content is by far the most important consideration. Documents are created by one or more authors to convey their ideas, and if those ideas involve mathematical expressions, the choice of rendering tool will be driven by convenience to the author(s) involved. Many prefer a graphical document preparation program such as LibreOffice, TeXMacs, or LyX. Others might prefer a simple HTML editor, wherein they write their HTML document, scan it with a browser, and be done with it without futher pre-processing. For these the client-side MathJax or LaTeXMathML JavaScript math translators might be attractive and particlarly suited for bloggers who like to rapidly write and enter their posts directly into the simple online editors provided by their host site's environment, such as Wordpress. For such time and ease of authorship are critical, and there is much to be said for the convenience of the content source management automatically provided by the server-side document database.

MathToWeb is well suited for documents written in native html with embedded LaTeX math expressions, if the goal is to translate those expressions to MathML as part of the authoring process. Browsers other than Firefox will still need a client-side script (Mathplayer or MathJax) to display the MathML. MathToWeb is actively supported.

I personally prefer to prepare much of my content as full LaTeX documents intended for processing by pdflatex anyway. For such running LaTeX2MathML or LaTeX4HT is no more difficult than would be running pdflatex by itself; each affords considerable flexibility in preparing HTML output from the same LaTeX source files. It depends on the document, its content, and how much formality one wishes to put into its preparation. (The page you are reading, for instance, is written in plain HTML.) Here is an example illustrating very simple math. More complicated math expressions may be found in the Optimized Cross-Correlation Spectral Matching article in this site's Projects section. Its intermediate LaTeX2HTML bitmapped image html file is provided for comparison. Here one can see where LaTeX2HTML completely mis-rendered eqs 26, 27, and 31. The MathToWeb engine used by LaTeX2MathML does a far better job and its MathML output, as rendered by Firefox, is pleasant to read.