Translating LaTex documents

I am trying to recapture Spanish after a long hiatus from four years of work in high school. As such, I decided that translating the KBD-Infinity instruction manuals would be educational as well as useful. A complete translation would take too much effort and could produce some strange results. I decided to use the resources of Google Translate (GT) combined with my reading knowledge of Spanish to remove any howlers. An instruction manual is an ideal candidate for machine translation. The original English text must be straightforward, with few idioms and no subtle turns of phrase.

LaTex is a popular markup language for typography. We use it for all our manuals. In comparison to word processors, it offers many advantages for maintaining documents. The source files are in text format that should be easy to upload to GT. Unfortunately, GT does not offer direct support for LaTex. In fact, it does some serious work mashing up the markup structure. Nonetheless, there are ways to compensate. In this article, I’ll describe an automated procedure that allows translation of an entire instruction manual, preserving all the visual structure of LaTex, in a half an hour or less.

First, a few words about using GT. For straightforward text, the translation is quite good. The service is free, and it is not necessary to be logged into Google. On the other hand, we have all come to know what “free” means on the Internet, so it would be wise not to upload documents with sensitive information. The main emphasis of GT is the translation of web pages. Accordingly, the service does a good job with the contents of an HTML pages, preserving the markup structure. Ironically, GT will not allow you to upload HTML documents directly. Supposedly, you can upload a variety of other document types (.doc, .docx, .odf, .pdf, .ppt, .pptx, .ps, .rtf,.xls, or .xlsx), but the performance is spotty. I found it refused to accept my .doc files for reasons unknown. It did accept PDF files, making a spare text-only translation with no visual elements. The best option is to use straight text (.txt).

There are some steps to prepare a Latex document for translation. First, GT will not accept a .tex file. It also refuses a .tex file with the suffix changed to .txt. Use a text editor to grab the content between \begin{document} and \end{document} and paste it in a .txt file. The second issue is that GT imposes an undocumented character limit. The general suspicion on the Internet is that the limit is 5000 characters. I found (as of this date) that GT would accept somewhat more than 20,000 characters. With larger documents, GT translates the first 20,000 characters and then fills out the remaining space with the original text, issuing no warning message. In consequence, input files larger than 20 kB must be split into parts. If you are concerned about issues of privacy or performance, it may not be useful to search for alternatives. Google has made GT open to websites and applications, so many translation services are simply front ends.

The translation process is simple.

  1. Set the language preferences, click on Documents and browse for a file to upload,
  2. The translated text appears in a text window. Press Ctrl-A then Ctrl-C to capture the output and paste it into a text file.

The aftermath is not so simple. In a long document, GT may introduce thousands of errors in the LaTex markup structures. For example, in a Spanish translation “\normalsize” becomes ” \ talla normal”. Note the addition of extra spaces, one of the primary problems. I realized that it would take days of hand labor to correct all the errors. My first thought was to create macros in my text editor, but this proved unworkable. In a typical LaTex document translation, there may be more than 70 different types of corrections to make. My ultimate solution was to write a utility code in which I could define any number of global replacements. This approach allowed almost instantaneous document correction.

ChangeIt interface

Figure 1. ChangeIt interface.

Figure 1 shows the interface of my program ChangeIt. The critical concept is the ability to load custom template files that define multiple rules for global search and replace operations. Here’s an example of the template that evolved for my use in Spanish translations, SpanishTemplate.CTP:

* ChangeIt template
$
$\ begin {$\begin{$
$\ end {$\end{$
$\nueva pagina$\newpage$
$\Tabla de contenido$\tableofcontents$
$\ textbf {$\textbf{$
$\ textit {$\textit{$
$\ textsf {$\textsf{$
$\ texttt {$\texttt{$
$\ textsl {$\textsl{$
$\Enorme$\Huge$
$\Grande$\Large$
$\pequeña$\small$

...

$\ bigskip$\bigskip$
$enumerar$enumerate$
$ \ rightarrow $\rightarrow$
$\ footnote $\footnote$
$ ~ \ ref {$~\ref{$
$\ #$\#$
$\ _$\_$
$\ nota al pie {$\footnote{$

The first line of the text file is a required identifier while the second line defines the delimiter. The file may contain any number of data lines, each giving a replacement rule. Each data line has the format Delim:Before:Delim:After:Delim. For example, the first rule in the example removes extra spaces, so that ” \ begin {document}” becomes “\begin{document}”. Note that ChangeIt can be used to correct any type of text file by using different templates.

Once a template is loaded, ChangeIt can either fix an entry pasted in the text area or correct an entire file. The procedure for the first option is to load text in the clipboard and then left-click inside the text area. When the Process text button is pressed, the program corrects the text, updates the area and places the result on the clipboard. In a correction, ChangeIt runs through the list of rules, applying each one in turn to the entire file. Because it’s easy to add new rules to a template as needed, the program works better the more it’s used. Typically, only a few subsequent hand corrections may be necessary, so the total translation and conversion time for a 15-page manual may be only 15-30 minutes. The resulting text can then be pasted into the original LaTex document between \begin{document} and \end{document}. To display the correct accent marks in the Spanish translation, I added one line to header:

\usepackage[spanish]{babel}

Figure 2 shows an output example for the translation of a LaTex document with extensive markup messages. Click here to download a full PDF manual translation: MIDI Doctor.

Example: translation of a Latex document to Spanish

Figure 2. Example: translation of a Latex document to Spanish.

 

Footnotes

[1] Find out more about KBD-Infinity: Home page.

[2] If you have comments or questions, please contact us at info@kbd-infinity.com.

Comments are closed.