## Hermes - a semantic XML+MathML+Unicode e-publishing/self-archiving tool for LaTeX authored scientific articles

Last update on by Romeo Anghelache
Users, developers and/or philosophers are invited to ask/send comments on the Hermes blog or to send their comments directly to the author.

### Examples

Some results of Hermes assisted conversions are hosted here; the source distribution also contains an article in LaTeX source, as well as a content-oriented source sample.

### What is Hermes?

Hermes is a grammar based translator from (AMS)LaTeX to Unicode(utf-8) encoded XML+MathML+metadata. It is free software (software libre).
Translating pure (AMS)TeX documents is not yet supported by Hermes, but this facility will be available sooner or later, depending on user interest.

### What for?

Hermes is here to help individuals at self-archiving, libraries at long term-archiving, and publishers at having a reference document for their various specific services.

### How does it work?

Hermes follows the steps below, in the specified order:
1. semantically seeds a copy of your TeX source
2. lets the TeX program do its job (texing) on this semantically enriched source
3. parses the resulting semantic dvi
4. generates the XML reference document, a semantic XML reflection of your TeX source.
It works on Linux, Windows and OS X.

### What is the Hermes reference document?

It is a Unicode XML document with a generic structure, containg free text  and various XML vocabularies.
It contains the semantics Hermes managed to recover from the LaTeX source.
Its validating XML-Schema will get published after this generic structure gets less fluid.
Currently, the generic structure consists of:
1. sections
2. presentation hints (currently font names and sizes),
3. free text ((accented)TeX glyphs mapped to their Unicode equivalent),
4. metadata (title, author, date etc.)
5. bibliography,
6. internal and external references (no need for special LaTeX packages to get these activated in the XML),
7. tables, images
These items are in a one-to-one relationship with the corresponding structures in the source/semantic dvi. This list is extensible: LaTeX environments automatically produce an XML structure.
The XML vocabularies reflect the vocabularies used in the LaTeX source, e.g. mathematical regions in the LaTeX source correspond to MathML regions in the reference document.
MathML is the only validable XML vocabulary implemented and supported currently by Hermes (SVG, and other vocabularies, like MARC, or other open standards, may follow,  if users are interested).
Of MathML, only MathML-presentation is generated if Hermes is used to translate legacy LaTeX files (here, by legacy LaTeX files I mean sources which were not edited with semantic vocabularies in mind) without manual intervention on the source.
MathML-content can only be generated if a newly authored LaTeX source uses the semantic LaTeX macros available in the Hermes distribution.
• the automatic generation of MathML-presentation is possible only if the LaTeX math expressions are originally well-formed, that is, made of balanced expressions (paired delimiters), this should not be an issue because typing mathematics in LaTeX is a commitment to a controlled vocabulary anyway;
• use of the \frac macros is encouraged over the '\over' macro (however, the seed utility delimits the regions covered by the '\over' macro getting it closer to the effects of a '\frac' macro).

### Installation requirements

A standard latex system, gcc, bison, flex, make and libxml/xslt should be on your system, in order to compile the program and have the proper example output (Windows developers can check out the Cygwin distribution, windows users will have a binary distribution (hermes.exe and seed.exe) issued (almost) synchronously with the source distribution.).
Developers and Unix users can unpack the source distro and run make.
After a successful 'make' you get:
• hermes and seed binaries;
• content.s.dvi - the semantic dvi result of a latex run on the content.s.tex, which, in turn, is produced from content.tex by seed
• content.xml - the reference document (XML+MathML-content) obtained by using Hermes semantic TeX macros.
• content.pub.xml - a renderable  transformation of the reference document as an XHTML file with embedded MathML content
• the Hermes stylesheet, pub.xslt, is used in this transformation, but you can use your own for different results/looks.
• the same goes for the other example file: article.tex (i.e. you will get article.pub.xml, the renderable instance of article.lib.xml, the reference document Hermes generated from the source).

### General use

1. - write an (AMS)LaTeX text containing mathematical expressions; LaTeX it and fix all your editing errors ;).
2. - latex document.tex, if you didn't get a dvi return to step 1
Use Hermes to get the reference document (library) and renderable (publish) XML files:
1. - run ./seed document.tex, if you didn't get document.s.tex go to found-a-bug
2. - latex document.s.tex, if you didn't get a document.s.dvi go to found-a-bug
3. - run ./hermes document.s.dvi >document.lib.xml, if you didn't get a document.lib.xml go to found-a-bug
4. - run xsltproc pub.xslt document.lib.xml > document.pub.xml, if you didn't get a document.pub.xml go to found-a-bug
5. - now you can archive or send document.lib.xml to your library, and post your document.pub.xml on your website, along with the MathML-stylesheets for others to read/reuse.
found-a-bug:
either let the author know, fix it or ask around.

### Architecture of Hermes

• a set of helper (La)TeX macros (the 'dlt.tex' file),
• a scanner, written for flex, tokenizing the semantic dvi file,
• a parser, written for bison; the grammar generates the XML output.

### Developer's tips

• does not replace nor modify the functionality of the TeX engine, so it should not restrict the set of macros used for authoring: it uses the dvi format as its input (it relies on the transparency of the TeX '\special' command).
• does NOT make inferences for MathML-content, instead, Hermes provides a set of LaTeX authoring macros (called Hermes semantic macros) to enable an author to write mathematical expressions which are covered by the MathML-content standard (not tested extensively).
• (almost) preserves the presentational output of the original source documents (remember, Hermes is intended to produce a document with semantics equivalent with the (La)TeX source, but fitter for long-term archiving or publisher processing, the final look depends entirely on the stylesheet you use to create a renderable instance of this document).
• provides the authors the freedom to semantically enhance (parts of) their original document, at their own pace: Hermes can generate a mixture of MathML-presentation and MathML-content. It's easily extendable to allow generating a mixture of other controlled vocabularies too.
• all the glyphs in the following TeX fonts are mapped into their unicode (utf8 encoding) counterparts: fonts having standard names as specified in fontname and the following fonts with non-standard names:cm.., ams, px.., tx.., ec.., tc.., ty.., euf.., [l,w]asy.
If the source document uses a font which is not in this list, Hermes dies noisily (listing the fonts which are not mapped yet) before even parsing the text. The list of supported fonts with non-standard names can be easily extended in future versions, at user's request (for any glyph which has a Unicode correspondence).

### To do

• test Hermes on various collections of TeX documents (arxiv)
• refine the LaTeX document structure Hermes is aware of
• refine the presentation oriented information
• check the completeness of the content oriented macros (used to generate MathML content) provided with the distribution
• add domain specific controlled vocabularies

### Credits

Hermes is covered by GNU GPL, and developed by Romeo Anghelache. It was created in the EU funded MoWGLI research project (ended in Feb. 2005), as a task for LivingReviews, from Max Planck Institute for Gravitational Physics, Golm, Germany.
Its further development was partially supported by :