LOGOS - Multilingual Translation Portal

39 - Prototext analysis and computer

"Perhaps instead of a book I could write lists of
words, in alphabetical order, an avalanche of
isolated words which expresses that truth I still do
not know..."¹.

  For many decades the application of computers to philological sciences, and to text analysis in particular, was considered a sort of taboo. This is mainly because a philologist's education - be he translator, critic, professor or researcher - historically has never implied a study of the computer science applications to texts, nor, on the other hand, does the computer scientist's education imply the application of electronic power to specific and markedly humanistic activities like philology and ecdotics².
  Nevertheless, I think it would be interesting to finish this second part of the translation course devoted to the prototext analysis with two units devoted to the opportunities computers provide to the translator wishing or needing to analyze the prototext in function of its translation.
After the blunders - caused by an excessive enthusiasm for the swift technical evolution - that allowed the first glimpse, in the Seventies and the Eighties, of the possibility to apply computers to everything, human sciences included, even translation, a more cautious stage of investigation began.
  The importance, in the decodification of a text, of the rapport between its context (a computer, having no semantic competence, can only be used to count occurrences; it is not able to define and delineate context); and, the possible meanings of a word, for the activation of the useful valences and the suppression of those not applicable in relation to its links to the co-text.
  That, together with the new technical opportunities, gave new strength to the use of corpora for human science research.
  By "corpus" we mean a set of texts sharing some feature: the author, or language, or time, or genre, or other. The fundamental principle guiding this study is the observation of language - a descriptive principle - in order to extract norms intended not as prescriptions, but as a ruler by which to measure regularity. And, in order to be able to observe language in its living form, it is useful to refer to corpora as constant parameters to be compared with specific speech acts, to be queried, and upon which to do research.
  One of the foremost fruits of the study of corpora is the compilation of concordances, i.e. lists of words (and strings of words) that repeat themselves identically within a corpus, linked with the data necessary to retrieve them (coordinates within that text). Concordances, compiled manually until just a few decades ago, with time consuming and fatiguing paper card indexing entry by entry, can now be compiled automatically by any personal computer after a preparation and pre-editing of the electronic text. From that a discipline was revived³ known as corpus linguistics.
  In translation studies, the use of computer and online corpora dates back to the last decade of the 20th Century. In some way, in translation practice reference to corpora is supplanting reference to dictionaries, enabling a more direct and concrete interpretation of the semantic spectrum of each word in its context.
  Whereas a dictionary provides for an interpretation of a word's meanings, the author's interpretation, and the translator's possible interpretive paths do not coincide with those that are useful in the given context, the consultation of a corpus is much more empirical and ductile, because it leaves the translator (or the decoder, anyway) the opportunity to intuit, to infer possible meanings from the very scene in which they rise: the speech act.
  In the text analysis in general, a computer, if one realizes the limits of its scope, can be a precious tool. Let us start by reviewing the limits of electronic text analysis.
  A computer is able to count the frequency of a word's occurrence, or of a string of characters, within a text, and maybe to calculate the predictability of such occurrences by comparing the frequency within a micro-text to the frequency within a macrotext containing the former (the corpus, for example). It is, therefore, very useful for lexical text analysis, and thus also of its lexical cohesion, of intertextual and intratextual references.
  Computer is also able to calculate the statistical predictability that a word co-occurs with (i.e. falls near) another, still within the framework of a given text, i.e. in relative, not absolute, terms.
  From what we just said, it appears evident that the person conducting the analysis has a very active role from the foremost stages of the work. Given the huge quantity of data a text can produce through the use of computers, it is indispensable for the investigation guidelines to be clear from the beginning. It is necessary that, before subjecting a text to computer analysis, a firm general knowledge of the work and its author, and of the given work in particular exists, and to hypothesize that the electronic tool may allow either to falsify or validate.
  In the meanwhile, you must also consider that the first results produced often direct research along channels initially unsuspected while other paths are abandoned, during the research owing to insufficient clues.
  A computer is not, therefore, able to set up the research, to decide which queries make sense and which do not; but neither is it able to interpret data. Once in possession of a list of a text's occurrences and co-occurrences, and of the corresponding coordinates, the translator must be able to say what data are really meaningful. Methodological criteria in this sense are not absolute, they vary according to the precise circumstances of the individual analyses.
  Are those words that recur oftenest most important or are the rarest? In classical philology there is the notion of hapax legomenon, defined as "a form recurring only once in the given text or corpus"⁴. The term rose because usually a rare form is considered particularly precious and, of course, a hapax is all the more precious when the corpus containing it is vast. A hapax exists only in presence of many other frequently used words.
  Shannon and Weaver talk about inverse proportion between predictability/informativeness. Translated into simple terms, and applied to what interests a translator dealing with translation-oriented text analysis most, one can say that in a text, the greater the chance of finding a word in a given position, the less significant is that presence. It is another way to look at the question of the markedness of a speech act that we have so often dealt with during this course. The reasoning is very similar to Shklovsky's on "defamiliarization" or "making strange", the breaking of perceptual automatism.
  In the 1920s Viktor Shklovsky, Russian Formalist, realized the importance, in text perception, of breaking the routine, of examining an object according to an unprecedented point of view⁵. An unmarked text passage, as we would call it today, i.e. a non defamiliarizing text passage, is perceived as if its form were secondary, and the reader is induced to automatically tune in on informative contents, denotative contents, without caring for expressive modes. A marked text, on the contrary, i.e. a defamiliarizing text, providing obstacles to "normal" perception, automatic perception, draws the reader's attention to the form, that in this way, just because it is patently obvious and alters the text's perception, becomes integral part of its contents.

Bibliographical references

CALVINO I. If on a Winter's Night a Traveller, London, Einaudi, 1979.

LANA M. Testi stile frequenze, in Lingua letteratura computer, edited by Mario Ricciardi, Torino, Bollati Boringhieri, 1996, ISBN 8833909905.

LANA M. L'uso del computer nell'analisi dei testi. Scienze umane e nuove tecnologie, Milano, Franco Angeli, 1994, ISBN 8820488701.

LANCASHIRE I, Using TACT with Electronic Texts. A Guide to Text-Analysis Computing Tools, New York, The Modern Language Association of America, 1996, ISBN 0873525698.

ORLANDI T. Informatica umanistica, Roma, La Nuova Italia Scientifica, 1990, ISBN 8843008870.

SHANNON C. E., WEAVER W. The Mathematical Theory of Communication, Urbana (Illinois), University of Illinois Press, 1949.

SHKLOVSKY, V. B. O teorii prozy, Moskva, Federacija, 1929. Teoria della prosa, translation by C. G. de Michelis and R. Oliva, Torino, Einaudi, 1976. Theory of prose, translated by Benjamin Sher with and introduction by Gerald R. Bruns. 1st American edition, Elmwood Park (Illinois), Dalkey Archive Press, 1990, ISBN 0916583546.

¹ Calvino 1979, p. 189.
² Textual philology. From the Greek ekdosis, edition.
³ Corpus linguistics was born and initially flourished before the advent of Chomskian theories. While generative grammar spread, implying a thoroughly opposite approach - starting from the deep structure working up to the speech act, while corpus linguistics starts from the speech act and works down to the contextual and ephemeral meaning - corpus linguistics came to a halt. The diffusion of personal computers with a power unthinkable before together with criticism against Chomskian theory gave new life to corpus linguistics.
⁴ Lana 1996.
⁵ Šklovskij 1929.