UTX (simple glossary format)

Asia-Pacific Association for Machine Translation
Content

UTX Home

Basics

FAQ

Download

Tools

Achievements and Articles

Contact

(Japanese version / 日本語版)

What is UTX?

Introduction

UTX (universal terminology eXchange) is a simple, tab-delimited terminology format, which has been established by AAMT (Asia-Pacific Association for Machine Translation). AAMT is comprised of three entities: researchers, manufacturers, and users of machine translation systems. Machine translation is the core technology for translation software. AAMT members are volunteers.

AAMT created the first specification to standardize user dictionary formats for MT systems, UPF, with support from IPA (Information-technology Promotion Agency, an institute in Japan) in 1995. In 2006, AAMT started to create new specifications to reflect and incorporate the subsequent advancement of technology and the changing usage of MT. In 2009, AAMT established the first UTX specification, which has been subsequently revised and updated. Based on this specification, anyone can create, publish, and share a UTX glossary (also called a "UTX dictionary"). The benefits of the simplicity brought by UTX are not limited to MT. UTX incorporates terminological management features that are valuable for human translators.

Why UTX?

With UTX, a user can easily create, share, and reuse glossaries to improve translation quality. Have you ever thought that translation software can produce only strange translations? When translation software fails to translate correctly, the problem is often that it doesn’t have sufficient translation knowledge of certain words and phrases that should be translated. You can greatly improve the accuracy of translation software (machine translation) by accumulating translation knowledge as a UTX glossary, and then converting it into a user dictionary of the translation software.

An individual user of translation software requires a huge effort to prepare effective user dictionaries. Also, even an Excel glossary or a simple plain text file is difficult to share or to reuse, if the entry format is not standardized. Many glossaries are available on the Internet, but their formats are not readily usable out-of-the-box. Time-consuming corrections and fine-tuning are required to use them in actual tools. However, if you use a standard format such as UTX, you can share a glossary among various tools, and quickly reuse it.

Who creates and uses UTX?

UTX is specifically designed to be created and used by translators and end users of translation software. It does not require any advanced technical knowledge of linguistics, grammar, machine translation software, etc. to create or use it. It can be made from minimum data such as basic parts of speech (noun, verb, etc.), and the plural form, if the entry is a noun.


In which domains?

UTX can be used in any specialized domain that has technical terms, such as ICT, medicine, legal, engineering translation, etc.

What kind of words should we include?

A UTX glossary contains only technical terms of specific domains, such as names of products, parts, diseases, medicines, and laws. It also contains proper nouns, such as names of people, places, and facilities. In many cases, entries are nouns, especially compound nouns. For example, a word like "XML declaration" can be correctly translated into its Japanese equivalent, "XML 宣言" by just registering it in a user dictionary. Basic vocabulary like "window" should not be included, because such words are already contained in the system dictionaries of translation software. Translation accuracy can be improved by collecting, sharing, and reusing the data of fine-tuned bilingual translations which are not included in translation software out-of-box.

Sentences should not be included, except when it is appropriate to treat them as "words." As a rule, UTX should be separated from translation memory, which is a bilingual database of sentences, but not words.

Multilingual glossary and term management

Since the character code of UTX is Unicode, it can handle almost any language. Normally, a UTX glossary includes only entries of the single source language A, and their translations in the single target language B. Starting with UTX 1.20, you can specify multiple target languages. With UTX, you can manage terminological quality and ensure that the correct terms are used. You can specify one of four statuses - provisional, forbidden, approved, and non-standard - to each entry. When multiple users contribute new terms, the initial term status would be "provisional" (or left blank). Then the term administrator checks each term, and if it is suitable, the administrator changes the term status to "approved." The term with "authorized" status can also be used for translation of the reversed direction (from language B to A). A "forbidden" status forbids the use of specific terms. A "non-standard" status means that even though the word is not the best translation, it needs to be included for the processing purpose (an example is an alternative spelling).

How do we make UTX?

A UTX glossary can be easily created, edited, and viewed with any spreadsheet application or text editor. Some tools are available to perform mutual conversion among UTX and various formats.

Tips for making a UTX dictionary (glossary)

Please also refer to the Quick Guide and the UTX specifications for details.

In what scenarios do we use UTX?

  1. Creating a glossary from scratch.
  2. Collecting translated terms during translation.
  3. As an intermediate conversion format for the conversion between various terminological formats.

How do we use UTX?

Since a UTX glossary is a simple format, it can be easily converted and imported to various tools. In tools such as OmegaT (a translation memory tool) and ApSIC Xbench (terminology reference tool), it can be used with very few changes.

What does it cost to create a UTX glossary?

You can download and use the UTX specifications for free.

More answers can be found in the FAQ.