20 - Search engines - part two
"the different arrangement of the material in the dream makes the dream untranslatable,so to speak, for the waking consciousness"1.
In the previous unit I have examined a few functions of the Google search engine. Above all, I have expanded on the possibilities offered by the linguistic tools of Google.
The arguments made in the previous unit consider the frequency of use of given words or word combinations. This implies considering the Internet as a corpus. But you must remember that the Internet is a corpus being created every minute in a spontaneous way, thanks to the interaction of all those willing to participate. Consequently, it doesn't have all the features of a corpus as it is meant in linguistic research, i.e. it is not balanced as happens in cases such as the British National Corpus, there is no artificial input of oral speech, written language, high registers, low register etc. As a result, it is completely extemporaneous.
Which doesn't mean that it cannot be a useful tool. High scientific registers are usually fairly well represented. Let's not forget that the international scientific community was among the first to be aware of the communication potential of the internet and to use it to anticipate, and sometimes substitute, paper publications, greatly speeding up the development of academic debate.
Middle and low registers are well represented here, too. The internet is peculiar in that is a "place" where you can publish without a filter. T his gives a potentially strong democratic value to the medium, because a filter can sometimes be censorious. On the other hand, this means also that any person able to write and read and having the concrete access - independent of the level and the type of education - can publish her own writings without the mediation of any editor.
This distinguishes the Internet from any other medium. For example, if a newspaper publishes the readers' letters, the editorial board potentially intervenes to correct the form, if not the contents, therefore you cannot say that what is published is exactly what had arrived from the reader. Moreover, the newspaper reader often has higher-than-average education level. When someone writes to a newspaper, hoping to see her own letter published, she furthermore tries to write as well as she can.
Otherwise, in the internet it is true that what one "says" is then put out for everybody, but it is true also that it is less apparently visible, more hidden among billions of other bits of available information. The fact is, whoever wants to express oneself feels much freer to express that thought as she wants, even using jargon expressions, or with expressions that are typical of oral speech, because the choice of the written channel is often made out of need, more than from free choice.
For this reason, both for the lack of an editing (not necessarily censorious) filter, and for the contextual freedom from a formal register, in the Internet there are very heterogeneous expressions that give the medium a strong interest from a linguistic point of view.
It is very important to remember to check, particularly when the results of one's searches are numerically scarce, in what site information is found, so as to attribute different reliability values to different sites. If I'm looking for the scientific name of a plant, for example, clearly a university site, or the site of a botanical garden, are pf greater value to me than a non-professional site in which normal citizens share experiences and advice about gardening. If I'm looking for the official name of some institution, I will obviously consider with much interest the site of another institution, and all the more so if it is an institution of the same country as the institution I'm looking for information about.
The presence in the internet of leads due to the spontaneous way this corpus gets formed also gives the internet features that can be used to the advantage of those searching it. Let us suppose, for example, that someone is in doubt about the government of a preposition, for example the Italian word "vicino". I try inserting the word in the search engine.
The first thing I notice looking at the results is that I need to limit my search to sites in Italian, because the first sites I find are in English and from the U.S., therefore are not interesting for my current search. I therefore click on "language tools" and choose Italian-speaking sites and go over the search again.
My next problem is that I find many occurrences of this word as an adjective or adverb, but in this case they are of no interest to me. How can I eliminate such occurrences from my search?
My doubt arises from the fact that I have found, in expressions of place, the word "vicino" both followed by "a" and followed by the next name. To make an example, both "vicino Perugia", and "vicino a Perugia", and I want to try to make out which one is the most widely used. I then try to put in the search engine the complete string with a town name, in quotes, which in Google's syntax means "words in this exact order". Or, if I don't remember I have to use quotes, I click on "Advanced search" and write my string in the box stating "Find results with the exact phrase". Here it is:
«vicino a Perugia» gives 307 results
«vicino Perugia» gives 277.
From this result I could apparently infer that the two versions are used indifferently. Still, when visiting Lombardy, you almost never hear the word "vicino" used as a preposition, instead of using the prepositional phrase "vicino a". The doubt then arises that there are local peculiarities of usage. To check my idea, I invent a hypothesis. It is more probable that the expression "near the town X" is used by people living near town X. To test such a hypothesis I try putting in the search box four different strings:
vicino a Milano
vicino Milano
vicino a Roma
vicino Roma
Here it is:
«vicino a Milano» 3110 results
«vicino Milano» 1580 results
«vicino a Roma» 1750 results
«vicino Roma» 3230 results
From these results some elements are evident. The first is that in the Milan area, as I suspected, the phrase "vicino a" is prevailing. The second is that in the Rome area the situation is the reverse: the usage of "vicino" as place preposition is prevailing. The first result, the one with "Perugia", was a false parity result, due to the fact that probably the area of Perugia is in an area under an equivalent influence from the North and from Rome.
Such information is much richer that what I can find in a dictionary (which tells me that the preposition "vicino" doesn't exist, that there is just the prepositional phrase "vicino a"). Now I know that in the Rome area (a look at the sites referred to in the search "vicino Roma" confirms that they are mostly in the Rome area) they tend to conspicuously use this expression. The answer from a dictionary is not necessarily enough for a translator: sometimes she needs to know if a given expression is typical of some areas, and if possible, of what areas.
In the next unit we'll go on examining the potential of search engines.
Bibliographical references
FREUD SIGMUND, L'interpretazione dei sogni, in Opere, vol. 3, Torino, Boringhieri, a cura di C. L. Musatti, 1966.
FREUD SIGMUND, The Interpretation Of Dreams, translated by A. A. Brill, London, G. Allen & company, 1913.
GOOGLE, available in the world wide web at the address http://www.google.com/, consulted 7 April 2004.
1 Freud 1900: 46.