LOGOS - Multilingual Translation Portal

40 - Prototext analysis and computer

«... "... an avalanche of isolated words which
expresses that truth I still do not know, and from
which the computer, reversing its program, could
construct the book, my book"¹.

  From what we just said about Shklovsky and perceptive automatisms, it is evident that, as to lexical markedness, and exclusively for that, the computer analysis of the text can be very helpful to the translator. Usually the translator will be inclined to translate a marked word with a marked word, and an ordinarily collocated word with an analogously standard one.
  Among the words that are more frequent in a text, on the contrary, there are the so called "empty words". They are mostly conjunctions, prepositions, articles, copulas not useful for poetic or conceptual expression, but exclusively for syntactic coherence. Always being the most frequent forms, this datum in se is not significant.
  Traditionally, empty words are ignored by researchers. Recently, however, some scholars postulated that they, just owing to their semantic nothingness and easy replaceability (just think, in the Italian language, at "tra" and "fra", "fino" and «sino» etc.), could form, especially if taken as a whole, text strings, identical segments unconsciously used by the author, which could signal a possible ancestral similarity, and, thus, a sort of fingerprint, of genetic patrimony of an author's style.
  If such a hypothesis is particularly indicative in the reconstruction of the authorship of ancient documents where the point is to decide the attribution of a text to one determined author rather than another, in a translation-oriented analysis it is suggestive too, because one can register such traits in the original and the check if they are also present in the metatext and, wherever they are not, one can decide if that implies consequences on the interpretive plane, of reception and intertextual influences.
On the other hand, the presence of very frequent words alien to the empty words group can be an index of textual cohesion. Just think for example of the names of characters or the use in their place - in the prototext or the metatext - of pronouns, of the presence of deictics and their treatment in translation; in the latter case, a general trend of translated texts to a higher degree of explicitness is noticed. Often the translator takes upon herself functions of cultural mediation that do not properly fit in with the translation function and, often idiosyncratically understanding the "translator's mission", tends to facilitate - often unconsciously - the reading of the metatext by substituting deictics referring to a contingent situation (this, here, now, the, one etc.) others that instead refer to a chronotopically distant or indefinite situation (that, there, then, a, the other etc.)
In brief, in order to attribute a precise meaning to low or high frequencies it is always necessary to add a human verbal synthesis that contextualizes the numerical data.
  Then there are words whose high frequency is not statistically significant, but indicates the presence of motives and themes within a text: isotopic networks having great cohesive significance even if not ranging among the rare occurrences. It is, therefore, important to check their presence even in the metatext.
  There are not, thus, exact universal rules for the interpretation of raw machine data: even the researcher equipped with powerful tools for exact measurements must use his scientific creativity in order to avoid being mislead by supposed objective truths.
  One of the most widespread programs for text analysis is called TACT and is freely available in the internet to all researchers and people not using it for commercial purposes. This simple program can work with text file in .txt format with line break. The TACT program is one of the most widespread in the universities all over the world. It is not easy to use, to learn to use it in a profitable manner requires some familiarity with computers as they were before the advent of user-friendly interfaces. It functions in DOS environments, and in a sense it ignores Windows.
  One of the data obtainable with TACT and other similar programs is the type/token ratio, i.e. among types of word and their occurrences. If all the word of a text are ordered in an alphabetic order, one realizes that some words are repeated, while others appear only once. Imagine recording the repeated words on the same line, the number of the lines indicates the types (different words), while the total number of words in a corpus is expressed by occurrences or tokens:
  .
  If a man is to do something more than human, he must have more than human powers. .
  In this microtext there are 14 types and 17 tokens, because the words "human", "more" and "than" are repeated twice:
  a
  do
  have
  he
  human human
  If
  is
  man
  more more
  must
  powers
  something
  than than
  to
A subsequently obtained given is the type/token ratio. The higher this ratio, the more lexically complex the text, rich, and, on the contrary, the lower the ratio, the more words repeat themselves and the lower lexical complexity.
  Another obtainable given is the collocation of word in the co-text, choosing how long the co-text sample must be.
  All obtained data must be anyway related to the dimensions of the reference text, because, of course, the smaller the text, the less significant the correspondent statistics.
  The consultation of corpora instead of dictionaries is not a quick or comfortable operation, on the contrary, it can require a good deal of time. It is the result which is more satisfactory, because it gives a more precise and concrete idea about the meaning of a word or sentence.
  As to English language, the greatest corpus available to the public is the British National Corpus, at the URL:
  http://firth.natcorp.ox.ac.uk/
  From here you can do queries knowing you can count of a set of texts spanning from the English language classics to oral texts recorded for the radio.
  As to Italian language, (but many more languages are represented, Latin included), one of the widest corpora is Wordtheque, at the address:
  http://www.logoslibrary.eu
  In both cases, the retrieved word refers to the text from which it was extracted, with data regarding the author and the publication. That allows the person consulting the corpus to draw her conclusions as to the reliability degree of the occurrence and the kind of usage register. To go more thoroughly into the matter of the last two units, refer to the books listed among the references. Those who have followed this course will be welcome to the third part, devoted to the production of the metatext.

Bibliographical references

CALVINO I. If on a Winter's Night a Traveller, London, Einaudi, 1979.

LANA M. Testi stile frequenze, in Lingua letteratura computer, edited by Mario Ricciardi, Torino, Bollati Boringhieri, 1996, ISBN 8833909905.

LANA M. L'uso del computer nell'analisi dei testi. Scienze umane e nuove tecnologie, Milano, Franco Angeli, 1994, ISBN 8820488701.

LANCASHIRE I, Using TACT with Electronic Texts. A Guide to Text-Analysis Computing Tools, New York, The Modern Language Association of America, 1996, ISBN 0873525698.

ORLANDI T. Informatica umanistica, Roma, La Nuova Italia Scientifica, 1990, ISBN 8843008870.

SHANNON C. E., WEAVER W. The Mathematical Theory of Communication, Urbana (Illinois), University of Illinois Press, 1949.

SHKLOVSKY, V. B. O teorii prozy, Moskva, Federacija, 1929. Teoria della prosa, translation by C. G. de Michelis and R. Oliva, Torino, Einaudi, 1976. Theory of prose, translated by Benjamin Sher with and introduction by Gerald R. Bruns. 1st American edition, Elmwood Park (Illinois), Dalkey Archive Press, 1990, ISBN 0916583546.

¹ Calvino 1979, p. 189.