Copyright, 1989, 1991, by the Logical Language Group, Inc. 2904 Beau Lane, Fairfax VA 22031-1303 USA Phone (703) 385-0273 lojbab@lojban.org All rights reserved. Permission to copy granted subject to your verification that this is the latest version of this document, that your distribution be for the promotion of Lojban, that there is no charge for the product, and that this copyright notice is included intact in the copy. from ju'i lobypli #8 - March 1989 Machine Translation and Lojban by Patrick Juola There is an apocryphal story about an early (1952) automatic English/Russian translation project. A visiting senator asked to see the machine work, and gave it the phrase Out of sight, out of mind to translate. The senator, however, knew no Russian, so the (Russian) phrase was sent back through for the senator to read. The new version read Invisible idiot. Historically, there have been three main methods of machine translation. The simplest, earliest, and least functional can be called direct translation. These machines operate like a first-year language student; looking up a particular word or phrase in a dictionary, writing down the corresponding foreign word (or phrase), then performing simple transformations (e.g., moving the verb to the end in German) to make the result more grammatical. Like a first-year student, this requires almost no "understanding" of the language in question, and (like a first-year student) the translations produced tend to be very bad. The Georgetown Automatic Translator was an early machine of this type -- started in 1952, it became operational in 1964 and was actually in use until 1979 (and its "daughter," SYSTRAN, is still commercially successful). With the improvement in machine capacities (and over thirty years to correct mistakes), SYSTRAN does produce acceptable translations. At the very least, direct translation provides a yardstick against which to measure more sophisticated projects. The majority of the modern MT projects1 use what is called the transfer approach. Translation is divided into three phases: Analysis, Transfer, and Synthesis. In analysis, the sentence, phrase, or paragraph to be translated is parsed into what is called a "parse tree," a detailed description of its syntax. At the same time, the meaning of the sentence is analyzed into a semantic network, describing the meanings of the individual words and the relationships among them. This is obviously a much harder project than merely looking up words in the dictionary but is frequently necessary to get an accurate translation. For example : The city council denied a permit to the women because they advocated violence. Who advocated violence, the council or the women? (In French, they would be translated differently, because of the genders.) The transfer involves changing the parse tree in ways specific to the source and target languages (the actual "translation", usually requiring structural changes as well as dictionary look-up), while the synthesis is reverse parsing. The transfer method tends to produce the most accurate translations. Because of its firm theoretical basis in linguistics, it tends to be robust, amenable to analysis, and easy to update/modify. On the other hand, it is expensive. The amount of semantic information required per word is usually extensive, and the transfer functions tend to be complex and expensive to write. The TAUM project2, for example, reported costs of $35-40 (Canadian) to develop each lexical entry (word, to non-linguists), as well as a cost of 16 cents/word for the actual translation and post-editing. (Comparable costs for human translation were about 12 cents/word.) Two sets of transfer rules must be written for each pair of languages. For the seven-language EEC, this would mean 42 different transfer functions. With the addition of Spain and Portugal, nine languages and 72 functions. And so on... The final approach (and the one where Lojban would be most useful) is termed interlingua. Rather than writing specific transfer rules, one can instead translate into "linguistic universals," where the meaning and structure of the internal representation are independent of the language from which they were derived. This method is the closest to true "machine understanding" of ____________________ 1See (Goshawke 1987) for a listing of some modern projects. (Slocum 1987) has more detailed (and technical) descriptions. 2Traduction Automatique de l'Universit‚ de Montreal, 1965-1981 (See Gervais 1980). language. Translation is then a two-step process; translating into the interlingua, then translating into the target language. This approach was pursued by CETA3 for ten years between 1961-71. It is obvious that this method is easily expandable to multiple languages, since the interlingual representation can serve as a source text for any number of translations. (Only 14 functions are needed for the EEC, for instance, and 18 when Spain and Portugal are included.) There are several disadvantages, however. The main problem is the design of the interlingua. Ideally, it should be capable of representing any human thought, with all its connotations and associations. The vocabulary required would stagger almost any programming team. For example, the verb to wear has four different translations in Japanese, depending upon what one wears. A programmer would not only need to be aware of this fact, but also be able to express what distinguishes the translations apart, in a form that the computer can understand. This problem is exacerbated by the fact that the computer does not "translate" using an interlingua; instead, it "retells" the source text in the target language. In poorly designed systems, this can lose important syntactic details (such as the passive voice), but even in good systems, style and connotations can be subtly (or unsubtly) affected. (In contrast, a transfer approach can use cognates and similar items and avoid this problem.) Since the computer must be able to generate an interlingual representation, if for any reason the parser fails (ungrammatical sentences, specialized jargon or acronyms, typographical errors, and simple programmer or hardware errors can all cause parser failure), there can be no translation, where a transfer or direct approach can at least generate a word-by-word or phrase-by-phrase translation. Finally, it is obvious that the computer must perform two translations, with corresponding intelligibility loss in both. Upon examination, Lojban appears to be an ideal or near-ideal interlingua. It is independently motivated to be an artificial language "capable of representing any human thought," although possibly with associations and connotations of its own. The actual vocabulary of Lojban is small, with the great majority of the "words" being metaphors (comet translates to bisli ke cmalu plini, "ice-small- planet."4 In this case, one "word" is a four-gismu utterance.) This makes the development, addition, and representation of words much easier, particularly if lexical entries contain property lists (i.e. +/-Human, +/-Animate, +/-Female, etc.) as they do in most modern translation projects. In Lojban, a tanru is defined simply and completely on the basis of its properties ("planet, +Ice, +Small"). Since Lojban has been developed from the world's major languages, cognates may be available for both translations, improving the accuracy of the translations. It still has both the required degree of independence (from existing languages) and a certain amount of freedom from connotations. Finally, the task of generating target text from Lojban should be marginally simplified by the regular structure and high available vocabulary (of tanru) of Lojban. Since vocabulary and language design problems have been the major stumbling blocks, Lojban could already prove a major asset to an interlingua MT project. The true power of Lojban as an interlingua cannot yet be realized, or even assessed. Computers cannot yet "think" in first-order logic.5 When this barrier is breached, (and at this point, we have left the realm of engineering and entered science-fiction) Lojban itself may be a computer programming language, making a Lojban-based translator self-programming. The implications of this are staggering, since at this point the "translation program" would be capable of understanding and learning from anything it read. ____________________ 3Centre d'Etudes pour la Traduction Automatique, Grenoble, France. 4Metaphor by Jamie Bechtel and Bob LeChevalier (JL 7, p.29). 5PROLOG, the best known logic programming language, can only express the "Horn clause subset" of first order logic. It is also neither sound nor complete. This is typical of logical languages where machine performance is a consideration. (Lloyd 1984) has some highly technical discussion of the limitations of logic programming. This could be a breakthrough into true Artificial Intelligence. At the very least, the translator would approximate human abilities in learning languages. When faced with a new word, the computer could ask for a definition (in the source language or in Lojban), then use its knowledge of the language structure to determine how that word should be used. For example, if it received the English word mallet, defined as a small wooden hammer, it would know (since it knows about hammers) that a mallet is -Human,-Animate, and +Noun. It could then use the phrase small wooden hammer in the target language, and if it ever receives the phrase small wooden hammer in another text, it can translate it into English as mallet. Even without the major breakthrough, the fact that Lojban is so structured and unambiguous simplifies vocabulary development. The act of defining a word as a tanru automatically assigns it a place in a semantic network, and this procedure could to a large extent be automated using existing AI techniques. Similarly, the act of translation into an unambiguous language automatically focuses attention on the ambiguities present in the source text, where expert systems or human intelligence can be brought to bear upon and immediately resolve them. In anything resembling artificial intelligence, there is always a conflict between those who want to "do it right", and those who want to "just do it." Since hardware is both inexpensive and fast, the performance of even bad designs can be quite good. In the long run, however, the problem of language understanding and translation is clearly fundamental to the development of true machine intelligence. Of the three current approaches to MT, the interlingua is clearly the most ambitious, but may offer the best long-term potential for the understanding of human and machine linguistics. Lojban (or a similar unambiguous language) cannot by itself solve the problems of automatic translation (or AI), but indicates an approach that may put the entire problem of machine intelligence on a much firmer footing. If human discourse can be expressed unambiguously and in a fashion that computers can use in their own "reasoning," machine intelligence is easily achievable. It is interesting to speculate on one apparent limitation of Lojban-based translation. Humor based on ambiguity would be very difficult, if not impossible, to translate, since the ambiguity would have been (deliberately) lost in the interlingua. Is this a real limitation of the translator? How important is ambiguity to human language understanding? This question relates closely to the Sapir-Whorf hypothesis, if we assume that native speakers of Lojban would have the same language background as a Lojban-based AI. Will native Lojban speakers be able to understand ambiguity at all? These questions must unfortunately remain open for quite some time (barring theoretical breakthroughs, until we have native Lojban speakers), but are important not only to Lojban, but to the fundamental theory of "linguistic universals" and machine translation. Bibliography Gervais A. (1980). Evaluation of the TAUM-AVIATION Machine Translation Pilot System. Translation Bureau, Secretary of State, Ottawa, Canada. Goshawke, Walter, et al. (1987). Computer Translation of Natural Language. Halstead Press, NY. Lloyd, J.W. (1984). Foundations of Logic Programming. Springer-Verlag, Berlin. Slocum, Jonathan, Ed. (1987). Machine Translation Systems. Cambridge University Press, Cambridge. Wilks, Yorick. (1985) Machine Translation and Artificial Intelligence : Issues and their Histories. Computing Research Laboratory, New Mexico State University, Los Cruces, New Mexico.