Copyright, 1989, 1991, by the Logical Language Group, Inc. 2904	Beau Lane,
Fairfax	VA 22031-1303 USA Phone	(703) 385-0273
lojbab@lojban.org

All rights reserved.  Permission to copy granted subject to your
verification that this is the latest version of	this document, that your
distribution be	for the	promotion of Lojban, that there	is no charge for
the product, and that this copyright notice is included	intact in the
copy.

from ju'i lobypli #8 - March 1989

Machine	Translation and	Lojban
by Patrick Juola

There is an apocryphal story about an early (1952) automatic English/Russian
translation project.  A	visiting senator asked to see the machine work,	and gave
it the phrase Out of sight, out	of mind	to translate.  The senator, however,
knew no	Russian, so the	(Russian) phrase was sent back through for the senator
to read.  The new version read Invisible idiot.

Historically, there have been three main methods of machine translation.  The
simplest, earliest, and	least functional can be	called direct translation.
These machines operate like a first-year language student; looking up a
particular word	or phrase in a dictionary, writing down	the corresponding
foreign	word (or phrase), then performing simple transformations (e.g.,	moving
the verb to the	end in German) to make the result more grammatical.  Like a
first-year student, this requires almost no "understanding" of the language in
question, and (like a first-year student) the translations produced tend to be
very bad.  The Georgetown Automatic Translator was an early machine of this type
-- started in 1952, it became operational in 1964 and was actually in use until
1979 (and its "daughter," SYSTRAN, is still commercially successful).  With the
improvement in machine capacities (and over thirty years to correct mistakes),
SYSTRAN	does produce acceptable	translations.  At the very least, direct
translation provides a yardstick against which to measure more sophisticated
projects.

The majority of	the modern MT projects1	use what is called the transfer
approach.  Translation is divided into three phases: Analysis, Transfer, and
Synthesis.  In analysis, the sentence, phrase,	or paragraph to	be translated is
parsed into what is called a "parse tree," a detailed description of its syntax.
At the same time, the meaning of the sentence is analyzed into a semantic
network, describing the	meanings of the	individual words and the relationships
among them.  This is obviously a much harder project than merely looking up
words in the dictionary	but is frequently necessary to get an accurate
translation.  For example : The	city council denied a permit to	the women
because	they advocated violence.  Who advocated	violence, the council or the
women?	(In French, they would be translated differently, because of the
genders.) The transfer involves	changing the parse tree	in ways	specific to the
source and target languages (the actual	"translation", usually requiring
structural changes as well as dictionary look-up), while the synthesis is
reverse	parsing.

The transfer method tends to produce the most accurate translations.  Because of
its firm theoretical basis in linguistics, it tends to be robust, amenable to
analysis, and easy to update/modify.  On the other hand, it is expensive.  The
amount of semantic information required	per word is usually extensive, and the
transfer functions tend	to be complex and expensive to write.  The TAUM
project2, for example, reported	costs of $35-40	(Canadian) to develop each
lexical	entry (word, to	non-linguists),	as well	as a cost of 16	cents/word for
the actual translation and post-editing.  (Comparable costs for	human
translation were about 12 cents/word.) Two sets	of transfer rules must be
written	for each pair of languages.  For the seven-language EEC, this would mean
42 different transfer functions.  With the addition of Spain and Portugal, nine
languages and 72 functions.  And so on...

The final approach (and	the one	where Lojban would be most useful) is termed
interlingua.  Rather than writing specific transfer rules, one can instead
translate into "linguistic universals,"	where the meaning and structure	of the
internal representation	are independent	of the language	from which they	were
derived.  This method is the closest to	true "machine understanding" of
____________________
1See (Goshawke 1987) for a listing of some modern projects.  (Slocum 1987) has
more detailed (and technical) descriptions.
2Traduction Automatique	de l'Universit‚	de Montreal, 1965-1981 (See Gervais
1980).

language.  Translation is then a two-step process; translating into the
interlingua, then translating into the target language.	 This approach was
pursued	by CETA3 for ten years between 1961-71.	 It is obvious that this method
is easily expandable to	multiple languages, since the interlingual
representation can serve as a source text for any number of translations.  (Only
14 functions are needed	for the	EEC, for instance, and 18 when Spain and
Portugal are included.)	There are several disadvantages, however.

The main problem is the	design of the interlingua.  Ideally, it	should be
capable	of representing	any human thought, with	all its	connotations and
associations.  The vocabulary required would stagger almost any	programming
team.  For example, the	verb to	wear has four different	translations in
Japanese, depending upon what one wears.  A programmer would not only need to be
aware of this fact, but	also be	able to	express	what distinguishes the
translations apart, in a form that the computer	can understand.	 This problem is
exacerbated by the fact	that the computer does not "translate" using an
interlingua; instead, it "retells" the source text in the target language.  In
poorly designed	systems, this can lose important syntactic details (such as the
passive	voice),	but even in good systems, style	and connotations can be	subtly
(or unsubtly) affected.	 (In contrast, a transfer approach can use cognates and
similar	items and avoid	this problem.) Since the computer must be able to
generate an interlingual representation, if for	any reason the parser fails
(ungrammatical sentences, specialized jargon or	acronyms, typographical	errors,
and simple programmer or hardware errors can all cause parser failure),	there
can be no translation, where a transfer	or direct approach can at least	generate
a word-by-word or phrase-by-phrase translation.	 Finally, it is	obvious	that the
computer must perform two translations,	with corresponding intelligibility loss
in both.

Upon examination, Lojban appears to be an ideal	or near-ideal interlingua.  It
is independently motivated to be an artificial language	"capable of representing
any human thought," although possibly with associations	and connotations of its
own.  The actual vocabulary of Lojban is small,	with the great majority	of the
"words"	being metaphors	(comet translates to bisli ke cmalu plini, "ice-small-
planet."4  In this case, one "word" is a four-gismu utterance.)	 This makes the
development, addition, and representation of words much	easier,	particularly if
lexical	entries	contain	property lists (i.e. +/-Human, +/-Animate, +/-Female,
etc.) as they do in most modern	translation projects.  In Lojban, a tanru is
defined	simply and completely on the basis of its properties ("planet, +Ice,
+Small").

Since Lojban has been developed	from the world's major languages, cognates may
be available for both translations, improving the accuracy of the translations.
It still has both the required degree of independence (from existing languages)
and a certain amount of	freedom	from connotations.  Finally, the task of
generating target text from Lojban should be marginally	simplified by the
regular	structure and high available vocabulary	(of tanru) of Lojban.

Since vocabulary and language design problems have been	the major stumbling
blocks,	Lojban could already prove a major asset to an interlingua MT project.
The true power of Lojban as an interlingua cannot yet be realized, or even
assessed.  Computers cannot yet	"think"	in first-order logic.5	When this
barrier	is breached, (and at this point, we have left the realm	of engineering
and entered science-fiction) Lojban itself may be a computer programming
language, making a Lojban-based	translator self-programming.

The implications of this are staggering, since at this point the "translation
program" would be capable of understanding and learning	from anything it read.
____________________
3Centre	d'Etudes pour la Traduction Automatique, Grenoble, France.
4Metaphor by Jamie Bechtel and Bob LeChevalier (JL 7, p.29).
5PROLOG, the best known	logic programming language, can	only express the "Horn
clause subset" of first	order logic.  It is also neither sound nor complete.
This is	typical	of logical languages where machine performance is a
consideration.	(Lloyd 1984) has some highly technical discussion of the
limitations of logic programming.

This could be a	breakthrough into true Artificial Intelligence.	 At the	very
least, the translator would approximate	human abilities	in learning languages.
When faced with	a new word, the	computer could ask for a definition (in	the
source language	or in Lojban), then use	its knowledge of the language structure
to determine how that word should be used.  For	example, if it received	the
English	word mallet, defined as	a small	wooden hammer, it would	know (since it
knows about hammers) that a mallet is -Human,-Animate, and +Noun.  It could then
use the	phrase small wooden hammer in the target language, and if it ever
receives the phrase small wooden hammer	in another text, it can	translate it
into English as	mallet.	 Even without the major	breakthrough, the fact that
Lojban is so structured	and unambiguous	simplifies vocabulary development.  The
act of defining	a word as a tanru automatically	assigns	it a place in a	semantic
network, and this procedure could to a large extent be automated using existing
AI techniques.	Similarly, the act of translation into an unambiguous language
automatically focuses attention	on the ambiguities present in the source text,
where expert systems or	human intelligence can be brought to bear upon and
immediately resolve them.

In anything resembling artificial intelligence,	there is always	a conflict
between	those who want to "do it right", and those who want to "just do	it."
Since hardware is both inexpensive and fast, the performance of	even bad designs
can be quite good.  In the long	run, however, the problem of language
understanding and translation is clearly fundamental to	the development	of true
machine	intelligence.  Of the three current approaches to MT, the interlingua is
clearly	the most ambitious, but	may offer the best long-term potential for the
understanding of human and machine linguistics.	 Lojban	(or a similar
unambiguous language) cannot by	itself solve the problems of automatic
translation (or	AI), but indicates an approach that may	put the	entire problem
of machine intelligence	on a much firmer footing.  If human discourse can be
expressed unambiguously	and in a fashion that computers	can use	in their own
"reasoning," machine intelligence is easily achievable.

It is interesting to speculate on one apparent limitation of Lojban-based
translation.  Humor based on ambiguity would be	very difficult,	if not
impossible, to translate, since	the ambiguity would have been (deliberately)
lost in	the interlingua.  Is this a real limitation of the translator?	How
important is ambiguity to human	language understanding?	 This question relates
closely	to the Sapir-Whorf hypothesis, if we assume that native	speakers of
Lojban would have the same language background as a Lojban-based AI.  Will
native Lojban speakers be able to understand ambiguity at all?	These questions
must unfortunately remain open for quite some time (barring theoretical
breakthroughs, until we	have native Lojban speakers), but are important	not only
to Lojban, but to the fundamental theory of "linguistic	universals" and	machine
translation.

Bibliography

Gervais	A. (1980).  Evaluation of the TAUM-AVIATION Machine Translation	Pilot
System.	 Translation Bureau, Secretary of State, Ottawa, Canada.

Goshawke, Walter, et al.  (1987).  Computer Translation	of Natural Language.
Halstead Press,	NY.

Lloyd, J.W.  (1984).  Foundations of Logic Programming.	 Springer-Verlag,
Berlin.

Slocum,	Jonathan, Ed. (1987).  Machine Translation Systems.   Cambridge
University  Press, Cambridge.

Wilks, Yorick. (1985)  Machine Translation and Artificial Intelligence : Issues
and their Histories.  Computing	Research Laboratory, New Mexico	State
University, Los	Cruces,	New Mexico.