Automatic glossary generation from translation memories

In a moment when machine learning, neural networks and NMT's are becoming the norm for computational linguistics, perhaps talking about translation memories and CAT applications may seem like prehistory. Yet they can still be valuable tools, at least for compiling and aligning corpora on which to train NMT applications.

But let's stick to translation memories. Would it be possible to automatically generate glossaries from them? I think it certainly is, by applying a simple statistical comparison.
Let us take the corpus I had already prepared for Marian NMT training: a table of about 40,000 rows in two columns, the first with the segmented Latin texts and the second with the Italian correspondents. Let us assume that we want to search for the meaning of the word ultio without having to use a Latin-Italian dictionary, such as the most famous Castiglione-Mariotti [1].

Well, it will be a matter of identifying in the corpus all the lines of the Latin column in which the word ultio appears and extracting the corresponding lines in the Italian column. At this point it will be necessary to count the frequencies of all the words contained in the extracted Italian sentences, ordering them in decreasing order by frequency, net of the so-called stop words (conjunctions, prepositions, articles and so on).
To refine the extraction we will set a probability threshold; half the maximum frequency can be a good compromise. So for example, if in an extracted list of 10 words the most frequent appears 100 times, all the words with a frequency lower than 50 will be discarded as least probable matches. The word, or the words, with a frequency equal to or higher than 50 will instead be the most likely translations of our Latin word.
Naturally, further refinement methods could be imagined and implemented to reach the most accurate result; and of course, a subsequent revision is essential. However, the crude statistical comparison described here is already capable of providing important and significant results.

By the way, the PHP script in which I have implemented the above steps gives vendetta and punizione (id est revenge and punishment) as most likely translations of ultio; answer fully confirmed by the Castiglione-Mariotti.

1. IL - Vocabolario della lingua latina by Luigi Castiglioni and Scevola Mariotti.

Automatic glossary generation from translation memories

Scrivi

Sezioni

Rete

Tag Cloud

Restiamo in contatto