Corpus linguistics applied to foreign language learning and teachnig
121 views | +0 today
Corpus linguistics applied to foreign language learning and teachnig
New and effective way to study a foreign language in context and in use. Corpora may be exploited individually or in class. Enables you to learn a language whenever and wherever you wish thanks to omnipresent Wi-Fi
Curated by kukuwka
Your new post is loading...
Your new post is loading...
Scooped by kukuwka!

Vance Stevens: Studying vocabulary using concordances on microcomputers

Vance Stevens: Studying vocabulary using concordances on microcomputers | Corpus linguistics applied to foreign language learning and teachnig |
kukuwka's insight:

That's great that we can find so many works on using corpus studies and concordances in a language class. But..the are all in English. That means an additional difficulty to code switch, to make the content and linguistic phenomena fit another foreign language teaching objectives. Why aren't there some evidence in French, for example? 

No comment yet.
Scooped by kukuwka!

Guy Aston: Corpus use and learning to translate

Corpus use and learning to translate

Guy Aston

1999. Textus 12: 289-314.

Given the fact that total bi-directional correspondences are extremely rare
phenomena, we often have to search for second-best matches, and that means we
have to select one of several alternatives, namely the one that fits the context best.
[...] It should be possible to come up with this match if the translator consults a large
corpus [...] and, by identifying the context pattern in question, finds the lexical unit
that would `naturally' be used in such a situation. All it needs is an operational
definition of context and context pattern.
(Teubert 1996: 241)
0. Introduction

A paper on the learning of translation must espouse some view of what translation involves. So let me state some premises. Following Joseph (1998), I take it that translation involves interpreting a source text (ST), and then generating a target text (TT) in another language which strategically directs its intended audience to an interpretation of it - generally one which in certain respects matches the interpretation given to the source text. From this point of view - substantially corresponding, as I understand it, to Nida's notion of "dynamic equivalence" (1964) - would-be translators must develop interpretative and strategic competencies which they may well lack in at least one of the languages involved for a particular task, since translators are rarely balanced bilinguals, nor always specialists in the discourse domain in question. In addition, translating - like editing - calls for the ability to elaborate, compare and evaluate different strategies and interpretations in the light of externally-defined contextual restrictions. Translators typically work under commission, where specific target audiences, and specific interpretations of the source and/or of the target text are implied (Reiss 1981).

The translator thus needs resources which can suggest possible and probable interpretations of the ST, which can indicate effective strategies for achieving particular interpretations of the TT, and which can facilitate the evaluation of alternative strategies and interpretations. Varantola (1997) suggests that as much as 50% of the time spent on a translation can be dedicated to consulting reference materials. In this paper I review the roles which can be played by electronic corpora in improving the quality and speed of the translation process, in helping would-be translators to develop their interpretative and strategic competence, and in developing their sensitivity to the issues involved. While in no way wishing to suggest that electronic corpora are a touchstone to resolve the translator's many problems, I believe that they can satisfy three significant criteria for translation instruments:

se facilitano il processo e portano ad una migliore qualita' del prodotto,
anche attraverso un aumento delle possibilita' di scelta dell'utente; se
offrono occasioni di apprendimento linguistico e metalinguistico; se
permettono lo sviluppo di una capacita' tecnica e critica nei confronti di
simili strumenti.
(Aston 1996: 308)

They achieve these objectives by providing "collections of helpful information which facilitate their decision-making and make them feel more secure about their choices" (Varantola 1997), allowing better and/or faster solutions to be obtained; by offering numerous opportunities for learning the language, the domain, and about the translation process; and by allowing the user to play an active role in their development and exploitation.
1. Types of corpora

Interest in corpora in the field of translation has been from two main perspectives, descriptive and practical. On the one hand, scholars have designed and analysed corpora of translations, comparing these with corpora of original texts in order to establish the characteristics peculiar to translations in particular SL-TL combinations (Gellerstam 1996), and indeed possible universals distinguishing translated texts (Baker 1998, Laviosa 1998). On the other hand, there has been a growing interest in corpora as aids in the processes of human and machine translation - their role which is my primary concern here. For this purpose, three main types of corpora have been proposed as relevant:

Monolingual corpora consist of texts in a single language, which may be either the source or the target language of a given translation. While general monolingual corpora include texts of a wide variety of types, specialized monolingual corpora are restricted to a particular genre and/or topic domain. In either case, the corpus attempts to provide a sample of a particular textual population, which ideally also reflects the variability of that population (Biber 1993).
Where monolingual corpora of similar design are available for two or more languages, they may be treated as components of a single comparable corpus. With a few exceptions (note 1), comparable corpora are currently specialized, with the texts belonging to genres or domains which are sociolinguistically similar in each of the cultures involved (in terms of participation framework, function, and topic), and have similar variabilities.
Parallel corpora also have components in two or more languages, consisting of original texts and their translations. Again, most parallel corpora are specialized. They take two main forms (figure 1):
Figure 1: Comparable and parallel corpora

language A language B

Comparable A. specialized corpus B. specialized corpus of same design

Parallel A. specialized corpus B. translations of texts contained in A

Parallel A1.specialized corpus B1.specialized corpus of same design as A1
Bidirectional A2.translations of B1 B2.translations of A1
Unidirectional parallel corpora consist of texts in one language along with translations of those texts into another language (or languages). Since the corpus in language A is by definition restricted to texts which have been translated into language B, this will not generally allow the textual population in language A to be representatively sampled (Aijmer et al 1996). The criteria to be adopted in selecting the translations to be included in the language B component are also debatable - for instance, whether these should be filtered for "quality" in some way. The two components are typically aligned on a paragraph-by- paragraph or sentence-by-sentence basis: that is to say, information is added to each sentence or paragraph of each text which indicates the corresponding sentence or paragraph in the parallel text in the other component (note 2). (For a review of alignment procedures, see
Bidirectional or reciprocal parallel corpora contain four components: source texts in language A and their aligned translations in language B, and source texts in language B and their aligned translations in language A. They thereby combine the characteristics of unidirectional parallel corpora with those of comparable corpora: if the same design criteria are employed for both languages, they include comparable collections of original texts in the two languages (A1 and B1), as well as comparable collections of translated texts in the two languages (A2 and B2). They additionally allow comparisons between original and translated texts in the same language (A1 and A2; B1 and B2: Johansson and Ebeling 1996).
In this paper I discuss the relevance of each of these types of corpus for the trainee translator. In addition, I shall consider the role of ad hoc corpora, i.e. corpora compiled "on the fly" by the translator in order to investigate a specific problem encountered during a particular translation.
2. Uses of different corpora

2.1 Monolingual general corpora

The obvious way in which corpora can help translators is as reference tools, as complements to traditional dictionaries and grammars. Thus the first sentence of Bruce Chatwin's Utz (1989a: 7) reads as follows:

An hour before dawn on March 7th 1974, Kaspar Joachim Utz died of a
second and long-expected stroke, in his apartment at No. 5 Sirok Street,
overlooking the Old Jewish Cemetery in Prague.
Let me focus on just one problem here, the translation of overlooking. If we examine the occasions where the word apartment occurs in the vicinity of overlooking in the 100-million word British National Corpus (, we find that apartments typically overlook mountains, rivers, oceans, ports, squares and gardens - all views which seem positively connotated. On the few occasions where what is overlooked is ugly, irony appears to be intended, as in:
Not only do they tolerate the fast-food shops serving up nutriment that top breeders
wouldn't recommend for Fido, they go as far as purchasing two expensive weeks in
a gruesome timeshare apartment, and sit smoking all day on a balcony
overlooking the A9.
The corpus data thus suggests that overlooking has a positive semantic prosody (Sinclair 1991, Louw 1993) - a fact which is unmentioned by dictionaries, and might even be overlooked by a translator whose native language was English. It aids interpretation of the ST, raising the problem of whether Chatwin intends the Prague cemetery to be seen by the reader as a beauty spot, or whether he is being ironic - or indeed, whether he simply aims at ambiguity in this respect.
A corpus can also help the translator evaluate - or indeed come up with - a possible translation for this sentence. The Italian translation of Utz (Chatwin 1989b: 9) renders it as:

Il 7 marzo 1974, un'ora prima dell'alba, nel suo appartamento di via Sirok 5 che
dava sul vecchio cimitero ebraico di Praga, Kaspar Utz mori' di un secondo
colpo da tempo previsto.
Does the choice of dava su share the positive connotations of overlooking, and allow a similar, possibly ironic, interpretation? In a small (2 million word) collection of Italian literary texts, we find the following instances of dava su:
Lei non si vedeva. Ma il soggiorno dava su una veranda da cui una scaletta
o negli onesti. La finestra di mezzo dava su un balcone di ferro. Concentr
finestrone, dai vetri impolverati, che dava su di uno spalto esterno, da cui si
la vasca. Al chiaro di una finestra che dava su un cortile interno, le sensazio
be potuto uscire subito dalla porta che dava sul sottoponte. Ma, quasi a prende
Muovendosi davanti alla vetrata che dava sul parco, il Bocchi vide i globi
omandante aveva una grande finestra che dava sul pozzo a lume; di fronte, con un
Bocchi abitava in un piccolo attico che dava sul Lungoparma, nel punto in cui
These citations offer little evidence that dava su has a distinctive prosody, and make it doubtful that this translation could be interpreted as ironic.
Data from monolingual corpora may thus support interpretative and strategic hypotheses, or suggest that they should be rejected. They may also suggest alternative hypotheses. In the English corpus, overlooking tends to be associated with a particular set of collocates (garden, sea, hills, square etc.). If we search the Italian corpus for occurrences of equivalents to these collocates (giardino, mare, montagna, piazza, etc.) in the vicinity of words like appartamento, camera, casa and finestra, we find citations such as the following:

se vuole posso prenotarle una camera per domani stesso, una bella e linda cameretta con vista sul
mare, vita sana, bagni di alghe, talassoterapia,
This citation suggests another possible translation of overlooking, namely con vista su. As we did with dava su, we can now test this against the corpus in order to see whether it is positively connotated, and whether there is evidence of its being used ironically - whether, that is, it occurs in similar contexts to overlooking.
A monolingual general corpus also provides a rich language learning environment. Even if the dava su hypothesis is rejected, the process of doing so allows the user to learn much which may be of value in the future. Unlike the dictionary, a concordance leaves it to the user to work out how an expression is used from the data. This typically calls for deeper processing than does consulting a dictionary, thereby increasing the probability of learning (Hultsijn 1992). In more general terms, by drawing attention to the different ways expressions are typically used and with what frequencies, corpora can make learners more sensitive to issues of phraseology, register and frequency, which are poorly documented by other tools.

Corpora also allow much unpredictable, incidental learning. Almost any concordance is likely to contain unknown or unfamiliar uses, which may be noticed and explored by the user who is prepared to go off at a tangent to follow them up (Bernardini 1997, in press). Looking through the occurrences of dava su, I noticed the unfamiliar expression pozzo a lume. While I can roughly understand its meaning from the context, I may be able to get a better idea of its use and frequency by generating a concordance of all its occurrences in the corpus.

As translation aids, however, monolingual general corpora pose a number of difficulties:

It may be difficult to locate and select an appropriate corpus. Reference corpora such as the Bank of English and the British National Corpus are sufficiently large and well-balanced to document the range of uses of all but the rarest lexical items in British English: there are, for example, 767 occurrences of overlooking in the BNC. But no similar corpus is yet publicly available for Italian, nor for American English. The Italian data cited above were taken from a relatively small (2 million word) collection of contemporary literary texts, put together from the Internet. The limited size and representativeness of this collection makes it much more difficult to identify and to evaluate regularities of use: there are only 8 occurrences of dava su, and it is debatable how far the intertextual background against which a translation of Utz should be interpreted is a purely literary one.
It may be difficult to retrieve appropriate instances from the corpus. Overlooking, for example, is polysemous, meaning either "looking out onto" or "ignoring". There is no way, using currently available corpora and concordancing software, that it is possible to find just one of these senses and exclude the other. Roughly 10% of the occurrences of overlooking in the BNC have the "ignoring" sense, and these must be excluded manually in order to effectively analyse the semantic prosody of the "looking out onto" sense. In the case of dava su, not only do other senses (such as that of "gave up") have to be excluded, but morphological variants of each word should probably be investigated (da/dava/danno/davano su/sul/sull'/sullo/sulla/sugli/ sulle), many of them polysemic, as should occurrences where the component words are separated by, for example, adverbials (dava direttamente su). Considering such factors will tend to reduce the precision of any search, making it more likely that "spurious" solutions will be found which require manual deletion (note 3).
It may be difficult to match the data to the translation. Whether evaluating a usage in the ST, or a candidate translation in the TT, the user is unlikely to find examples which precisely match the required context. Analysing concordances requires identification, classification and generalization to establish recurrent patterns and to relate these to particular contextual features (Johns 1991: 4), and these procedures require training and practice. Faced with a concordance of overlooking, the learner will need to group uses with different senses (e.g. animate and inanimate subjects and/or physical and abstract objects: overlooking the problem vs overlooking the park), and to draw inferences as to what features are shared by a particular group. Since this obliges the user to discriminate and attend to uses which differ from that occurring in the source or target text, the process will be time-consuming, and arguably dispersive in terms of the translation at hand - even if rewarding in language learning terms, where greater understanding of the different uses of overlooking may be a valuable by-product.
2.2 Specialized and comparable corpora

These difficulties can be reduced by using corpora which are specialized, that is, which consist only of texts of a type similar to the ST and/or the desired TT. Such corpora may be extractable as sub-corpora from large general ones - though only limited specialization can be obtained without compromising representativeness (Sinclair 1991) - or they may be specifically collected - an investment which may be well worth the effort where the translator foresees doing a number of similar translations in the future, and which is therefore a useful exercise for any translator training course (Maia 1997).

Specialized corpora can be seen as a development of the tradition of using "parallel" texts in translation - i.e. collections of texts of the same kind as the ST and/or TT (Haartman 1980; Williams 1996) - with electronic format enabling more rapid and systematic searching of larger quantities of text. Such corpora are particularly useful for the investigation of forms and meanings which are typical of that type of text (in particular terminology, but also features of register and text structure: Gavioli, forthcoming; Zanettin, forthcoming), and as an environment in which to prepare for work which has to be carried out under time constraints, such as speech interpreting. Varantola (1997) underlines how specialized corpora have high "reassurance value", particularly where the TT is in the translator's L2, insofar as they illustrate similar contexts to those of the translation being worked on.

Where specialized corpora have to be constructed by the user, this involves design decisions as to what texts to include and why. One of our early experiences in Forl with specialized corpora involved learners who were translating material for the Melozzo centenary exhibition into English, for which we compiled English and Italian corpora from CD-ROMs of the National Gallery and of the Uffizi. Each corpus contained texts of similar types describing artists and their works, genres, schools and technical aspects for a lay public. While limited in size (under 100,000 words each), their specialization and authoritativeness made them appropriate resources for the task, and given their similar composition, the two corpora could also be treated as comparable. Today, corpora of texts of this type could also be compiled from the Internet (Pearson, forthcoming). Clients are also a potential source of relevant specialized texts.

With respect to general monolingual corpora, specialized ones are easier to handle and in many ways more informative. In particular:

It is easy for the user to become familiar with the texts included in a small specialized corpus, facilitating the interrogation of that corpus and the interpretation of data from it (Aston 1997) - a familiarity which will be further enhanced if the corpus has been compiled by the user (Maia 1997).
A specialized corpus can provide figures concerning lexical density and repetition for texts of that type, in the shape of standardized type/token ratios, numbers of types accounting for different percentages of tokens, etc. The user can compare means and variances of these figures with values for the ST and/or TT, to see how well the latter match the norms of the corpus.
Concordances are less likely to contain spurious citations. Insofar as the frequency of different senses varies according to text-type, the likelihood of encountering other senses of polysemous items may be reduced (if we simply restrict a search for overlooking to the subcorpus of fiction in the BNC, the proportion of examples of the "ignore" sense drops by 50%). Gavioli and Zanettin (1997) provide a clear example of this phenomenon. Faced with the ST phrase etilisti con o senza marcatori HBV in a medical research article on hepatitis C, the initial translation hypothesis was alcoholics with or without HBV markers. However a search for markers in a specialized corpus of similar articles in English revealed the recurrent positive/negative for HBV markers. This would not have emerged from the BNC, where there are fewer texts of this type, and where other senses of the word markers predominate (in reference to examinations, pens, and linguistics). The viability of such a corpus, however, depends on whether the intertextual background for the use of an expression can be confidently limited to a single text-type in this way.
Specialized corpora also provide more assistance in formulating translation hypotheses. The greater precision provided by a specialized corpus allows us to extend the principle of using collocates to identify possible equivalents to complete texts or segments of texts. Where an expression in the ST is used in relation to a particular person or concept in the field in question, it may be possible to locate possible equivalents by searching for references to that same person or concept in the TL corpus and then reading the surrounding text. Software can facilitate this process: Wordsmith Tools (Scott 1997) shows which texts and parts of those texts contain most occurrences of a particular form. Remaining within the field of hepatitis C research, let us say that we wish to find an appropriate translation for the ST's casi mortali per insufficienza epatica. Given that all the texts in the corpus deal with hepatitis C, we can guess that those where death is most frequently mentioned are most likely to be relevant. The following table shows the files where the word death is most frequent:
N File Words Hits per 1,000 words
1 sx.11 1,436 17 11.84
2 rx.11 965 11 11.40
3 mx.11 914 8 8.75
4 rx.7 870 3 3.45
Reading the first of these, we find the expression fatal liver disease - a translation hypothesis which we can then investigate using the entire corpus.
Specialized corpora facilitate analyses related to textual macrostructure. Insofar as the texts in the corpus share similar structures and functions, it is easier to relate occurrences to particular functions and positions in texts (Aston 1997).
Incidental learning from texts and citations from a specialized corpus is more likely to be relevant to the task at hand, or at any rate to come in useful for further translations of a similar nature (Zanettin, forthcoming). For instance, a concordance of panel painting in the National Gallery corpus includes references to types of panel paintings, such as tondi, and to the techniques whereby they were created, such as pastiglia - terms which may well prove useful at other points of an art history translation.
A specialized corpus provides a useful means of learning about an area in which the translator needs to work and its textual conventions. Key concepts can be located manually in wordlists, or a wordlist from a specialized corpus can be compared with one from a general corpus in order to highlight the distinctive features of the former (Wordsmith Tools carries out such comparisons for both single words and phrases). If the corpus is comparable, a candidate list of terms in one language can be matched with one for the other language to create a terminology bank.
While most work involving specialized corpora as translation aids has used TL corpora (Bowker 1998, Varantola 1997, Friedbichler and Friedbichler 1997), where comparable specialized corpora are available, these can also be used to investigate the SL and the ST, particularly where the conventions of the latter are relatively unfamiliar, as a means to identify routine and non-routine uses. Comparable corpora seem particularly useful for learning purposes, as a means of exploring a particular text- type in both languages prior to engaging in translation.

2.3 Corpus construction

Since specialized corpora for a particular text-type are rarely available off-the-shelf, the translator needs to learn to construct such corpora - an experience which will develop awareness of their potential validity and reliability. Collecting a reasonably representative set of texts of a particular type requires a preliminary survey of the textual population and of its variability, as well as of the authoritativeness of candidate texts. Friedbichler and Friedbichler (1997) recommend selecting texts which have been subject to peer review, and which are where possible widely cited in the specialist literature (note 4); Varantola (1997) recommends avoiding texts written by non-native speakers.

It is clear that for any specialized corpus, the greater the variability of the text- type to be represented, the larger the corpus should be. In general, the larger the better, though there is clearly a point where the returns on expansion diminish. Friedbichler and Friedbichler (1997) suggest that for English, authoritative specialized corpora of 500,000 to 5 million words (according to the variability of the text-type) should provide solutions to 97% of the translator's questions. In what follows, a number of criteria for evaluating specialized corpora are proposed: in each case, the smaller the value the better.

The smaller the type/token ratio, the more lexically repetitive the corpus, and hence the better documented the types it contains. A ratio of 2% means that each word-type occurs, on average, 50 times every 1000 words in the corpus.
While indicating the extent of documentation of the types contained in the corpus, the type/token ratio gives no indication of whether those occurring in a similar text from outside the corpus will be documented. This probability can be assessed from the ratio of hapax legomena (word-types which occur only once in the corpus) to the total number of tokens: an HL/tokens ratio of 2% means that when reading a new text, an undocumented type is likely to be encountered, on average, every 50 words.
The HL/tokens ratio does not however consider the variability of the text-type. This can be assessed by considering the proportion of word-types that occur in only one text in the corpus. This provides a further indication of the likelihood, in any new text, of encountering new types. A proportion of 20% means that in any similar text, 20% of its word-types will on average be undocumented.
All these measures are a function of variability within and across texts, and of corpus size (and in the case of the last measure, also of text size): a small but homogeneous corpus of weather reports may well have lower values than a much larger one of tourist guides. Values will also depend on the language of the texts: given the greater morphological complexity of the language, Italian corpora tend to have higher values than English ones (note 5).

The translator can use measures such as these to assess the reliability of a particular specialized corpus and hence to determine its required size. Values obtained on the last two measures can also be compared with the actual proportions of undocumented types encountered in the ST and/or TT, as an indication of the "goodness- of-fit" of the corpus for the text in question.

This fit will rarely be perfect, and in any case no specialized corpus is ever likely to document all the problems posed by a particular text. Specialized texts also use non- specialized language, and the intertextual background on which they draw will rarely be simply that of the text-type in question. There thus remains the need to recognize where general monolingual corpora should be called on, or where it may be useful to compile a corpus ad hoc to analyse a particular problem.

2.4 Ad hoc corpora

Specialized corpora will rarely document every word in an ST or TT, even if they are likely to provide a much fuller documentation for features typical of that text type than large general ones. One learner using a comparable specialized corpus on cancer of the colon in order to translate an English research article into Italian was completely nonplussed by an allusion in the ST to the holy plane, for which she could find no explanation or equivalent. In such cases, relevant information may be obtainable from a large monolingual corpus or, failing that, CD-ROMs or the Internet. We can in fact use the Internet to compile corpora ad hoc, using search engines to find all the texts containing particular expressions. Since the world-wide web is an ever- changing entity of dubious authority whose overall composition is unknown, considerable care must however be exercised in selecting texts and drawing inferences (Pearson, forthcoming).

The value of such ad hoc corpora can be illustrated by an example from Bertaccini and Aston (forthcoming), which focusses on the translation into English of a French newspaper article which contained the word clochemerlesques. Searches were made for clochemerl* in a CD-ROM of Le Monde, and using the Altavista search engine on the Internet ( Together, these turned up 20 French texts, analysis of which allowed for a fairly confident interpretation of the ST: Clochemerle was a comic novel by G. Chevallier which ridiculed factionism in village politics, apparently well-enough known as an archetype of petty factionism to be alluded to without explanation by French journalists.

How could it be translated in English? Searches for English examples of clochemerl* on the Internet, and in CD-ROMs of The Independent and The Daily Telegraph, suggested that Clochemerle was far from equally familiar to a British public, and that it was if anything associated with public conveniences. Did any archetype in British culture have similar associations to the French one? One possibility which came to mind was Gulliver's Travels, and the conflict in Lilliput between Big- and Little-endians as to the right way to crack an egg. However, further searches provided no evidence that reference to Lilliput, or to big /little-endian, would have these associations for a general reader (the former seemed associated exclusively with size, and the latter were terms in computer architecture). The final (if less than fully satisfactory) solution was local squabbling, whose derogatory connotations were confirmed by a study of the semantic prosody of squabbl* in the BNC.

In such cases, an ad hoc corpus is clearly better than none, though very time- consuming to compile. Friedbichler and Friedbichler (1997) suggest that to be cost- effective, searches using corpora should not exceed an average of ten seconds: so the use of ad hoc corpora must be limited to a very small proportion of the problems posed by any translation.

2.5 Parallel corpora

A further limit of monolingual and comparable corpora as translation tools is the difficulty of generating hypotheses as to possible translations. The user must rely on known or suspected equivalences as heuristics to retrieve similar contexts in a TL corpus, providing a specification which is both sufficiently general to recall a range of possibilities, and sufficiently precise to limit the number of spurious hits. S/he must then verify that the citations retrieved are in fact sufficiently similar to those of the ST and/or the SL corpus. These procedures are both time-consuming and error-prone: an expression in the TL corpus may occur in a similar context to one in the SL corpus, yet in fact mean something different. For example, in attempting to translate the phrase loop ileostomy in a medical research article, Ferri (1999: 64) illustrates how a search for similar contexts in the TL found ileostomia su bacchetta. Without detailed medical knowledge, she initially assumed this term to be equivalent, while it is in fact hyponymous.

Greater certainty as to the equivalence of particular expressions can be obtained by using parallel corpora, consisting of original texts and their translations, where these are similar to the ST and TT. If the corpus is aligned, and suitable software is available, the user can locate all the occurrences of any expression along with the corresponding sentences in the other language.

There is however a dearth of parallel corpora for English and Italian, and relatively little parallel concordancing software for the PC (though see Barlow 1995, Woolls 1997). The examples which follow were extracted using Multiconcord (Woolls 1997), from its sample collection of different language versions of discussions in the European Parliament. This material has many limits, since we do not know which version constitutes the original text, and which a translation, or indeed a translation of a translation (Lauridson 1996). Nevertheless, it can illustrate how a parallel corpus may provide a means of identifying translation hypotheses in a specialized environment.

The following concordance shows occurrences of the word establish and its equivalents in Italian (some citations are abbreviated for reasons of space):

We support the Socialist Group's demand for the President to establish a
committee as soon as possible to conduct such a review. Condividiamo la
richiesta del gruppo socialista in base alla quale il Presidente dovrebbe
istituire quanto prima un comitato per la realizzazione di questa

if we are to guarantee the quality and competitiveness of the European tourist
industry, we shall have also to develop new forms of synergy with other Community
policies, bringing in all of the interested parties in an effort to establish
the conditions favourable to the development of the Union's tourist enterprises
per garantire la qualita' e la competitivita' dell'industria europea del turismo,
occorre inoltre sviluppare nuove sinergie con le altre politiche comunitarie,
coinvolgendo tutte le parti interessate al fine di creare le condizioni
favorevoli allo sviluppo delle imprese turistiche dell'Unione

Thus we need to establish a coherent European tourism policy which
adds value above and beyond Member State level and against which we can judge
and monitor the very considerable sums of money which are spent through other EU
funds ed e' quindi necessario realizzare una politica europea per
il turismo globale, che aggiunga valore al di sopra ed oltre il livello di Stato Membro
e rispetto alla quale possiamo valutare e controllare le notevoli somme di denaro che
vengono spese attraverso altri fondi europei

It is vital at this point that we establish diplomatic relations and
therefore a dialogue with the current Kabul authorities, Si rivela indispensabile
in questo momento, instaurare relazioni diplomatiche e quindi un
dialogo con le attuali autorita' di Kabul,

It must put an end to the inconsistencies and finally establish a clear
and independent foreign policy, at last shouldering its responsibilities, without
hesitation and avoiding inconsistencies. Metta fine alle sue contraddizioni e
elabori finalmente una politica estera chiara, autonoma, si assuma
finalmente le sue responsabilita', senza tentennamenti e senza contraddizioni.

We must ask the Union to establish whether the proposals made by
these countries under the aegis of IGADD will be able to bring about a solution and
if so to give them our support. Invitiamo l'Unione a verificare
se le proposte avanzate da questi Stati nell'ambito dell'IGAD siano tali da favorire
una soluzione e, in caso positivo, la sollecitiamo a dare il suo sostegno.

We need more specific signs and we need clearer evidence that the Belarus
Government does indeed want to establish a free and more democratic
society. Ci servono segni piu' precisi, cosi' come deve essere precisa l'intenzione del
governo bielorusso di instaurare a tutti gli effetti un sistema libero
e democratico.
This illustrates a wide range of possible equivalents to establish: avviare, creare, elaborare, ginstaurare, realizzare, verificare. For the translator of an English text of this kind, it thus suggests a range of hypotheses which can be further investigated using a general or specialized TL corpus.
Not all expressions are paralleled by such a wide variety of equivalents. One of the most frequent lexical words in the Italian component of the corpus is relazione. The parallel English term is invariably report (unlike the British parliamentary paper). In contrast, under a third of the occurrences of another frequent word, favore, are paralleled by favour: parallel to votare a favore di we find vote for; parallel to accogliamo con favore, we welcome. The corpus suggests equivalents for technical terms, and a wider variety of possible translations for sub-technical lexis than are likely to be found in a bilingual dictionary, particularly at a phraseological level. It may also highlight syntactic contrasts, including differences in the organization of the text into sentences and paragraphs.

Using such a corpus can also have a positive impact on learning. Where a variety of parallel realizations are encountered, this may help learners to distinguish between different contexts of use, and reduce their tendency to think in terms of one-to-one equivalence, as Ulrych (1997) illustrates in respect of parallel English realizations to ossia. More general problems may also be faced: Danielsson and Mühlenbock (forthcoming) illustrate how a parallel corpus can cast light on translation strategies for proper names, showing whether these are transcribed, translated, clarified or simplified. Johns (forthcoming) proposes a number of types of exercises using parallel concordances, for instance by blanking out the search word in language A and asking learners to infer it from the parallel citations provided in language B.

Since parallel concordances provide translations of each occurrence, citations are more likely to be immediately understandable for the user, diminishing the difficulties of retrieval and risks of misinterpretation associated with monolingual and comparable corpora. For the same reason, the scope for incidental learning may be increased. However, notwithstanding their apparent face validity, parallel corpora also introduce new dangers deriving from the assumption that parallel occurrences are effectively equivalent. It is necessary to ask whether the translations in the corpus are reliable and authoritative (note 6), and to bear in mind that the use of translations to identify equivalents inevitably implies "reduc[ing] the target language to a mirror image of the source language" (Teubert 1996: 250) - or the SL to a mirror image of the TL:

There is, for instance, no direct T[ranslation] E[quivalent] in English for the German
word Schadenfreude [...] Therefore, we will rarely find occurrences of
Schadenfreude in German translations of English texts. Generally
speaking, translations in language B will contain `grosso modo' only those lexical
items which count as TEs for items of the vocabulary of language A.
The same is true for syntax. The `impersonal passive' (e.g. Es wurde viel
getrunken, literally `It was drunk a lot') is a fairly common syntactic
construction in German for which there is no equivalent in English.
(Teubert 1996: 247)
Using translations as models for the TT thus risks reproducing those features of "translationese" which have been identified by workers using corpora in descriptive translation studies: normalization, simplification, explicitation (Baker 1993, 1998), "sanitization" by reducing connotational meanings (Kenny 1998), increased cohesion (Over†s 1998), and lower lexical density, higher mean sentence length, and higher proportions of high-frequency words (Laviosa 1998). Gellerstam (1996) shows how translations into Swedish of English texts carry over many features of English vocabulary, syntax, and rhetoric when compared with comparable Swedish originals; Gavioli and Zanettin (1997) illustrate some similar features in Italian translations from English. Using parallel corpora seems likely to reinforce such tendencies (though it is of course possible that they may increase learners' awareness of these features, and hence their conscious control of them: Ulrych 1997).

The unreliability of the translations in parallel corpora makes it advisable to use them in conjunction with monolingual or comparable corpora, so that, for instance, a translation hypothesis derived from a parallel corpus can be tested against a collection of original texts in the language in question. The ideal parallel corpus, from this point of view, will be bidirectional or reciprocal (cf 1 above), allowing the user to see whether occurrences found in translations into language B are also found in original texts in language B, and whether these are translated into language A in the manner encountered in original texts in language A. Such a corpus combines the advantages of a parallel corpus with those of a comparable one: from this point of view, bidirectional English-Italian corpora would seem an important area for future research and development. Such corpora are however considerably more difficult to design and compile than comparable ones, given the need to create comparable collections of texts which have been translated, and to align the texts and translations prior to use. Given the amount of work involved, they are likely to be relatively unspecialized in order to extend their range of application (see e.g. the English-Norwegian parallel corpus: Johansson and Hofland 1993). Consequently there is still likely to be a role for comparable and unidirectional parallel corpora of a more specialized nature. One form of the latter may be compiled by the specialized translator (or their client), drawing on the texts that s/he has (had) translated in the past (cf note 5 above).

It should be noted en passant that parallel concordancing software can also be used to analyse a single text and its translation. This is potentially a useful tool for translators to check and evaluate their own translations. Aligned versions of the ST and TT can be used to see whether a particular term in the ST has been translated consistently in the TT, or whether (given the tendency of translations to be less lexically varied than their source texts) a particular expression in the TT corresponds to a variety of expressions in the ST. Type/token ratios and lexical density measures for the ST and TT can also be compared, and evaluated by comparison with those found in comparable or parallel corpora of similar texts.

3. Conclusions

There is as yet little hard empirical evidence to demonstrate the effectiveness of corpora as translation and as learning tools. Williams (1996) found a 40% improvement in the recovery of correct equivalents when "parallel" texts were used as translation aids as opposed to bilingual dictionaries, and one might expect these results to be matched or bettered with larger collections of texts in electronic format, and the aid of retrieval software. In a pilot experiment Bowker (1998) found that learners using a specialized corpus of texts in the target language (their L1) showed greater correct term choice and idiomaticity than a matched group using bilingual dictionaries alone. On the other hand, Bernardini and Aston (forthcoming) found that on two translation tasks into the L2, learners using monolingual L2 dictionaries performed better than matched groups using a general L2 monolingual corpus. While learners seem to a large extent enthusiastic about using corpora, it remains to be shown just in what respects, and under what conditions, their performance as translators may improve as a consequence: we cannot for instance exclude the idea that training with corpora may also improve dictionary usage, by instilling greater attention to collocation and register. No research that I am aware of has yet attempted to compare the effectiveness of different types of corpora, or of different learner approaches to them; yet more difficult to measure are the overall effects of corpus use on learning, be this in terms of general linguistic knowledge and ability, or as relating to a specialized text-type.

In this climate of empirical uncertainty, arguments for and against the use of corpora in translator training must be of a theoretical nature, and can resort at best to anecdotal evidence. Where available and accessible, appropriate corpora appear able to provide better and faster solutions to many of the translator's problems in a unified environment, with positive effects on learning. They make possible more idiomatic, native-like interpretations of source texts and a use of more idiomatic, native-like strategies in target texts. It is our experience at Forli' that few trainee-translators who have used corpora would wish to be without them, notwithstanding (or because of?) the investment in time and effort required to compile corpora and to learn how to use them, and we expect that as the number of available corpora and the quantity of suitable software increases, the use of corpora for translation and translator-training will gather further momentum, with a growth in its cost-effectiveness.


1. The Parole project aims to produce general comparable corpora for all the languages of the EU ( ml).

2. Parallel corpora can be extended to include multiple languages (Woolls 1997), or multiple translations of each text (Ulrych 1997, Malmkjaer 1998). As the value of such extensions seems more descriptive than pedagogic, I shall not discuss them here.

3. In the gave up sense, su is of course an adverb rather than a preposition. If the corpus used is tagged with part-of-speech codes (as is the case with the BNC and the Bank of English), it may be possible to avoid unwanted senses by searching for a specific part of speech, e.g. dare su=PRP (or an equivalent formalism). Part-of-speech tagging may also facilitate analysis, enabling the data to be sorted by part-of-speech code.

4. Bowker (1998) and Pearson (1996, 1998) argue that where specialised corpora are used to train translators in a specialised field, they should include a range of different types of text - expert, instructional, and popularised. The latter types, they argue, are likely to explain terms and concepts which are taken for granted in expert texts. However, it is important not to confuse these types in the corpus, since we would not, for example, expect divulgative texts to have the same collocational and colligational regularities as specialist ones, nor to contain the same range of terms as the latter. Where the corpus is used to translate a specific text, the appropriate component should be given priority.

5. King (1997: 396) compares the number of types in translations of Le petit prince with the French original: scoring the latter as 100, figures for English and for Italian are 83 and 107 respectively.

6. This may, for instance, be dubious if all the translations in the corpus have been produced by the same translator, as is often the case with "translation memory" systems.


Aijmer, K., B. Altenberg and M. Johansson (eds.), 1996, Languages in contrast, Lund University Press, Lund.
Aston, G., 1996, "Traduzione e tecnologia", in G. Cortese (a cura di), Tradurre i linguaggi settoriali, Edizioni Cortina, Torino, pp. 293-310.
Aston, G., 1997, "Small and large corpora in language learning", in Lewandowska- Tomaszczyk and Melia, pp. 51-62.
Aston, G. (ed.), forthcoming, Learning with corpora.
Baker, M., 1993, "Corpus linguistics and translation studies: implications and applications", in Baker et al., pp. 233-250.
Baker, M., 1998, "R‚explorer la langue de la traduction: une approche par corpus", Meta, 43/4, pp. 480-485.
Baker, M., G. Francis and E. Tognini-Bonelli (eds.), 1993, Text and technology: in honour of John Sinclair, Benjamins, Amsterdam.
Barlow, M., 1995, "ParaConc: a concordancer for parallel texts", Computers & texts, 10.
Bernardini, S., 1997, "A `trainee' translator's perspective on corpora", available online,
Bernardini, S., in press, Competence, capacity, corpora, CLUEB, Bologna.
Bernardini, S. and G. Aston, forthcoming, "Do corpora actually help translators?".
Bertaccini, F. and G. Aston, forthcoming, "Exploring cultural connotations through adhoc corpora", in Aston (forthcoming).
Biber, D., 1993, "Representativeness in corpus design", Literary and linguistic computing, 8/4, pp. 243-257.
Bowker, L., 1998, "Using specialized monolingual native-language corpora as a translation resource: a pilot study", Meta, 43/4, pp. 631-651.
Burnard, L. and T. McEnery (eds.), forthcoming, Papers from TALC 98 (provisional title), Peter Lang, Bern.
Chatwin, B., 1989a, Utz, Pan, London.
Chatwin, B., 1989b, Utz, trans. D. Mazzone, Adelphi, Milano.
Danielsson, P., and K. Mhlenbock, forthcoming, "Retrieval of name translations in parallel corpora", in Burnard and McEnery.
Ferri, S., 1999, Uso di piccoli corpora comparabili per la traduzione medica, unpublished dissertation, SSLMIT, Forli'.
Friedbichler, I. and M. Friedbichler, 1997, "The potential of domain-specific target- language corpora for the translator's workbench", available online,
Gavioli, L., forthcoming, "Corpora and the concordancer in learning ESP: an experiment in a course of interpreters and translators", in G. Azzaro and M. Ulrych (eds.), Anglistica e ....: metodi e percorsi comparatistici nelle lingue, culture e letterature di origine europea. Volume II: Transiti linguistici e culturali, EUT, Trieste.
Gavioli, L. and F. Zanettin, 1997, "Comparable corpora and translation: a pedagogic perspective", available online,
Gellerstam, M., 1996, "Translations as a source for cross-linguistic studies", in Aijmer et al., pp. 53-62.
Hartmann, R.R.K., 1980, Contrastive textology: comparative discourse analysis in applied linguistics, Julius Gross Verlag, Heidelberg.
Hultsijn, J.H., 1992, "Retention of inferred and given word meanings: experiments in incidental vocabulary learning", in P.J.L. Arnaud and H. B‚joint (eds.), Vocabulary and applied linguistics, Macmillan, London, pp. 113- 125.
Johansson, S. and J. Ebeling, 1996, "Exploring the English-Norwegian parallel corpus", in C. Percy, C.F. Meyer and I. Lancashire (eds.), Synchronic corpus linguistics, Rodopi, Amsterdam, pp. 3-15.
Johansson, S. and K. Hofland, 1994, "Towards an English-Norwegian parallel corpus", in U. Fries, G. Tottie and P. Schneider (eds.), Creating and using English language corpora, Rodopi, Amsterdam, pp. 25-37.
Johns, T., 1991, "Should you be persuaded: two examples of data-driven learning", in T. Johns and P. King (eds.), Classroom concordancing, (ELR journal, 4), Centre for English language studies, Birmingham, pp. 1- 16.
Johns, T., forthcoming, "Reciprocal learning: a practical application of parallel concordancing'.
Joseph, J.E., 1998, "Why isn't translation impossible?", in S. Hunston (ed.), Language at work, BAAL/Multilingual Matters, Clevedon, pp. 98- 108.
Kenny, D., 1998, "Creatures of habit? What translators usually do with words", Meta, 43/4, pp. 515-523.
King, P., 1997, "Parallel corpora for translator training", in Lewandowska- Tomaszczyk and Melia, pp. 393-402.
Lauridsen, K., 1996, "Text corpora and contrastive linguistics: which type of corpus for which type of analysis?", in Aijmer et al., pp. 63-71.
Laviosa, S., 1998, "Core patterns of lexical use in a comparable corpus of English narrative prose", Meta, 43/4, pp. 557-570.
Lewandowska-Tomaszczyk, B. and P.J. Melia (eds.), 1997, PALC'97: practical applications in language corpora, Lodz University Press, Lodz.
Louw, B., 1993, "Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies", in Baker et al., pp. 157-176.
Maia, B., 1997, "Do-it-yourself corpora... with a little bit of help from your friends!", in Lewandowska-Tomaszczyk and Melia, pp. 403-410.
Malmkjaer, K., 1998, "Love thy neighbour: will parallel corpora endear linguists to translators?", Meta, 43/4, pp. 534-541.
Nida, E., 1964, Towards a science of translating: with special reference to principles and procedures in Bible translating, E.J. Brill, Leiden.
Over†s, L., 1998, "In search of the third code: an investigation of norms in literary translation", Meta, 43/4, pp. 571-588.
Pearson, J., 1996, "Teaching terminology using electronic resources", in S. Botley, J. Glass, T. McEnery and A. Wilson (eds.), Proceedings of Teaching and language corpora 1996, UCREL, Lancaster, pp. 203-216.
Pearson, J., 1998, Terms in context, Benjamins, Amsterdam.
Pearson, J., forthcoming, "Surfing the internet: teaching students to choose their texts wisely", in Burnard and McEnery.
Reiss, K., 1981, "Type, kind and individuality of text: decision making in translation", Poetics today, 2/4, pp. 121-131.
Scott, M., 1997, Wordsmith Tools (ver. 2.0), Oxford University Press, Oxford.
Sinclair, J.M., 1991, Corpus, concordance, collocation, Oxford University Press, Oxford.
Teubert, W., 1996, "Comparable or parallel corpora?", International jounral of lexicography, 9/3, pp. 238-264.
Ulrych, M., 1997, "The impact of multilingual parallel concordancing on translation", in Lewandowska-Tomaszczyk and Melia, pp. 421-435.
Varantola, K., 1997, "Translators, dictionaries and text corpora", available online,
Williams, I.A., 1996, "A translator's reference needs: dictionaries or parallel texts", Target, 8, pp. 277-299.
Woolls, D., 1997, MultiConc (ver. 1.0), CFL Software Development, Birmingham.
Zanettin, F., 1998, "Bilingual comparable corpora and the training of translators", Meta, 43/4, pp. 616-630.
Zanettin, F., forthcoming, "Swimming in words: corpora, translation and language learning", in Aston (forthcoming).
No comment yet.
Scooped by kukuwka!

Natalie Kübler [UFR EILA]

Natalie Kübler    [UFR EILA] | Corpus linguistics applied to foreign language learning and teachnig |
Teaching English Verbs With Bilingual Corpora: Examples in the Computer Science
Natalie Kübler 1
Pierre-Yves Foucou2
In French universities, most computer science syllabuses include compulsory teaching in English. However, English teachers
are not necessarily experts in computing, and textbooks or dictionaries are not complete, and rapidly become obsolete,
especially with regards to verbs. Yet it is precisely the English verb system which French-speakers have trouble mastering,
particularly in technical areas.
We shall describe how using various types of corpora, such as technical English corpora, aligned English to French corpora,
and « general » English corpora has allowed us to achieve two objectives : the
discovery and description of the authentic use of technical verbs; and
the preparation of teaching material. The resultant description will firstly help us to identify more appropriate pedagogic
objectives for teaching a specialist’s language ; it will then serve in a Web-based language teaching environment to generate
learning activities.
0. Introduction
In French universities, English classes are very often included within specialised training, because English is
nowadays the mostly used language in the technical and scientific world. English is particularly necessary in the
computer science (CS) area because of the impressive and quick expansion of the domain. At the linguistic level,
this is translated into a greater productivity in the coining of new terms or new uses of already existing terms.
The technical documentation and terminology of most software packages or operating systems is first been
written in English. Translating the documentation into other languages raises the issue of the double
competence : users must have both linguistic and technical knowledge. This problem is becoming more acute in
the teaching of English as a second language.
Observation of real language usage can invalidate conventional, and over-simplifying hypotheses. Let us
consider the simple example of navigating on the Internet : different terms are used in the various browsers to
describe the same function of memorizing addresses (URLs4) : the French notion of signet matches bookmarks,
hotlist and favorites, in « Netscape » and « Internet Explorer » respectively. Students can easily acquire these
uses, but closely related uses can present some difficulties :
(1) You should bookmark this page now !
(2) *You should favorite this page
(3) Bookmark this page in your favorites !
Furthermore, different translators will not always agree on the translation of a term or an expression : one
French translation for bookmark is marque-page, but the following has also been found:
(4) Bookmarquez cette page !5
To allow students to find their way through this ever-changing jargon, it is necessary to teach CS English in a
contrastive way by using authentic documents. This permits computer scientists – whatever their technical
competence – to feel at ease in English, as well as in French. French translations lead beginners in computer
science to a better understanding of technical documentation. More advanced computer scientists should be able
to deal with the French terms, whilst they are already used to working with the English terms. That is why
translators often give the English term at the beginning of a translated document, and subsequently use the
French equivalent throughout. Thus, terms like chipset for ensemble de composants, spool for queue or file
d’attente, or even spreadsheet for tableur, can usefully be given at the beginning of a French document because
they are already known to French-speakers. In the present article, we describe the pedagogical experiences that
1 Université de Paris 7 :
2 Université de Paris 13 :
3 We would like to thank A. J. Renouf for her very helpful comments on an earlier version of this article.
4 URL : Uniform Resource Locator : from the Free On-Line Dictionary of Computing
5 We found around 100 occurrences of this form on Altavista.
took place at the Technology Institute of Villetaneuse at the University of Paris 13. We shall develop one of the
most problematic issues for French-speaking learners : mastering CS English verbs. This point is particularly
crucial, all the more so since it has often been overlooked in textbooks or specialised dictionaries.
We shall show how available corpora on the Internet can be used to present the students with varied examples, in
contexts that are simple, yet encompass all possible structures. The contrastive analysis of bi- or multi-lingual
technical documentation can lead to support a description of the same uses in different languages. Using
authentic and constantly updated documents introduces a reality component in the description of usage : we aim
at describing the verbs that are actually used by a scientific community, rather than the description of terms that
have been standardized by an official body. We use the conventional corpus query tools that have been
developed at the Laboratoire de Linguistique Informatique of the University of Paris 13. These tools have been
adapted to the specific needs of language teaching : simple and bilingual concordances, the automated creation
of learning activities, and so on.
1. Verbs and Corpora
A pedagogical choice
Confronting French-speakers with CS English can cause them some problems in comprehension and production.
Very few verbs are presented in technical dictionary entries; they are often be introduced at the end of a noun
entry, without any other information than the part-of-speech (POS) category. It is however these that pose the
main problems. Once non-native speakers have acquired a technical term, be it simple, multi-word nouns, or
adjectives, they seldom have further problems with it. The more they progress in computer science, the less this
type of terms poses problems, because they have acquired the specific terms of their subject area. The difficulties
that are encountered, be they on the level of comprehension or production, relate primarily to the verbs, as we
noted among French-speaking students, whether they be beginners in English or more advanced.
In our project, we are currently developing a description of the English CS verbs and their equivalents in French.
We have divided the verbs into three different categories, which are quite similar to pragmatic approaches to the
definition of terms. Hoffman (1985) suggests that there are three categories of terms in a specialised vocabulary :
subject-specific vocabulary, non subject-specific vocabulary and general vocabulary. For Trimble and Trimble
(1978) , there are highly technical terms, a bank of technical terms, and sub technical terms. While the first two
categories are the same as the first two described by Hoffman, the last one covers the terms coming from the
general language, but that have taken on a specific meaning in specialised subject areas.
As our aim is slightly different from describing terms for native speakers, we chose an approach which takes into
account the point of view of non native speakers, i.e. a pedagogical point of view. Examining the verbs, we
noticed that the highly technical verbs (according to Hoffman’s first category) are very often neologisms6 which
have to be acquired as such. The second category of verbs partly matches the first and second categories of
Hoffman and Trimble and Trimble, since it consists of verbs that already exist in general English, but that have
acquired a specialised use. The last group corresponds to both the third category of Hofman and of Trimble and
Trimble : it consists of general English verbs that are used in CS English, particularly those that are extremely
frequent and that are difficult to master for French-speakers in this subject area.
Our approach has potential for the creation of pedagogical material allowing teachers to present students directly
with authentic data, as well as to automatically generate learning activities, such as drills for example. We have
indeed a Web-assisted language learning (WALL) environment (Foucou & Kübler 2000), which generates
learning activities allowing students to practice acquired knowledge.
1.1. Existing pedagogical material (dictionaries/textbooks ; online/offline)
A great number of textbooks offer descriptions of the specific characteristics of CS English, but these often
remain basic. The verb/noun ambiguity, which is typical in technical English, and the great versatility in the
creation of new terms are rarely mentioned. Very few indications are given about the sentence, i.e., the verbs
structures and their distributional and transformational properties. As far as translations are concerned, CS
English verbs and their equivalents in French are frequently described as lists that are unfortunately not always
complete, and do not contain information about the different contexts of use, leaving the user to guess which
translation must be used in which context.
6 This is not surprising as computer science is producing new concepts almost everyday, especially with the
development of the Internet.
General dictionaries are genrally sparing in their inclusion of CS terms (which is not their primary function, as
they are not specialised dictionaries), and specialised dictionaries are often incomplete (for non native speakers)
or become very quickly obsolete. The information provided by these two types of dictionaries is not very useful,
given the real nature of texts. This explains why it is necessary to resort to more current reference sources. We
agree with Pearson (1998), for whom the context is the only way of making the difference between a term and a
word. This means here that we shall use corpora to decide whether a verb should be described or not.
CS dictionaries focus on nouns and their meanings, as well as their possible translations in French (in bilingual
glossaries). Beginners and French-speaking students in computer science (such as French university students in
the first two years) will find definitions, which are sometimes encyclopedic, in FOLDOC (Free On-Line
Dictionary Of Computing7) or in other CS dictionaries. Students are faced with the same type of explanations
and French translations of the terms in the various bilingual dictionaries that can be found on the Web8.
· Numerous specialised acronyms are found in dictionary entries. Three types of acronyms can be found in
bilingual dictionaries :
- Acronyms that are translated into French, such as ISDN (Integrated Service Digital Network) translated
in RNIS (Réseau Numérique IntIntégré de Service).
- Acronyms of which only the expansion is used in French, such as OS (Operating System), which is
translated by système d’exploitation, but for which the French acronym SE is very rarely used except
among purists.
- Finally, acronyms that do not have a translation in French, such as SCSI (Small Computer Interface
System), or MSDOS (Microsoft Disk Operating System).
· Dictionaries also contain some very specialised modifiers, such as controller-less, big or little endian.
1.2. Difficulties of French-Speakers
We have noticed among French-speaking learners several types of difficulties which are related to the verb
system in English.
· Verb/noun ambiguity (nominal use of verbs and vice-versa) : It can be difficult for students to distinguish a
verb from a noun ; for a native speaker of English the context alone is enough to make the difference, which
is not the case for a non native speaker. This is all the more difficult since French-speakers often do not
know how easily and frequently verbs can be created from nouns, (such as to zip out of zip, a program used
to compress data) or nouns from verbs (such as a login based on the verb to log in). Moreover, some English
verbs have no direct equivalent in French, and are translated by paraphrases, or support verbs and their
predicate nouns (collocations).
· Polysemy : Some extremely polysemous English verbs can pose comprehension or structural problems for
French-speakers. To run is a good example ; on one hand, its various uses are variously translated into
French, on the other hand, some of its structures are determined by the possible arguments of the verb.
· Structural differences between French and English : Structure differences among very similar verbs in the
two languages are often the cause of interference errors for French-speakers (Kübler 1995). This is also the
case in CS English.
The teaching of CS English cannot be achieved without a description of verbs and their structures.
Unfortunately, it is exactly this type of description that is missing in textbooks. It can however be extracted from
corpora. A thorough description of CS verbs appears to be necessary, not only for teaching, but also for other
applications, such as automatic error correction or automated translation systems.
7 See footnote 4.
2. Identifying problematic verbs
2.1. Specialised, general, and parallel corpora
The fast development of the World Wide Web opens up access to ever expanding ressources in terms of corpora.
Using technical documentation which is exclusively related to the real world has the advantage of introducing an
authentic component ; its importance has been highlighted for years in the literature on this subject (T. Johns
1988). In order to describe the reality of CS English, we chose as a working corpus the Linux HOWTOs (half a
million words). The HOWTOs represent an easy to access and regularly updated technical documentation that
has the advantadge of being multilingual. They have been translated into several languages, including into
In order to be thorough, we sampled other corpora. Texts relating to computer science offer a wide variety of
styles and levels of language. We chose to use a representative sample of different possible styles. Our corpora
have been extracted from the almost inexhaustible ressources offered by the World Wide Web, and divided into
five categories :
i) Technical Documentation
- user’s manual of the UNIX operating system (250 documents, 16 MB, 53300 types)
- the Internet RFCs which are the instructions for use of the Internet (2000 files, 85 MB, 161083
ii) Specialised On-line Press
Wired : computer science magazine (1000 articles, 5MB, 38392 types)
iii) Newsgroups
Newsgroups deal with various aspects of computing ; the level of language is quite casual, and can be,
at the same time, extremely casual, as shown in the following example, which has been extracted from
the comp.lang.perl.misc newsgroup :
(5) You should either use double quotes or joins, but not both :
Either :$file = ‘../dir/dir/dir/’. $country.’_’ $machine ;
Or, preferably (at least to me) :
$file = « ../dir/dir/dir/$country_$machine ;
should be :
$file = « ..dir/dir/dir/${country}_$machine » ;
Our newsgroups contains, for the time being, approximately a thousand articles (ca. 6500 types).
iv) FAQs (Frequently Asked Questions)
FAQs are often related to some newsgroup and consist of files that contain the most frequently asked
questions on a given subject. For example FAQs about the following subject are available : Y2K bug,
Solaris OS, or even Windows.
v) « General » English
To relativize the results and examine them from different angles, we use « general English » corpora,
such as The Times (3'500'000 words), or The Herald Tribune (1'500'000 words). Other CS English
corpora allow us to check specialised uses, « general English » corpora are used to verify the degree of
specialisation of the selected verbs.
2.2. Frequencies
A first sampling of our corpus permitted us to obtain a list of the most frequent verbs. In the highest frequencies
of the HOWTOs, the first three verbs (once auxiliaries and modals were discarded) are the following :
Use 3114 occurrences run 1565 occurrences install 1163 occurrences
Using 1726
Used 1192
Use 196 (partly nominal
run 886
running 523
runs 140 (i.e. a very low
percentage of nouns)
install 662
installed 369
installing 132
The number of occurrences very quickly drops to a few hundreds (to boot has around 500 verbal occurrences),
or even less than a hundred (to download has around 40).
These results can be compared with the frequencies in The Times where the three most frequent verbs are use,
run (general English uses, and not CS English), and call :
Use 30324 occurrences run 26697 occurrences call 13771 occurrences
Used 13333
Use 11363
Using 4333
Uses 1295
run 12773
runs 4517
running 6541
ran 2866
called 12445
call 5922
calls 3601
calling 1793
The frequencies in the French corpus, i.e. the French translations of the HOWTOs, are surprisingly different. The
most frequent verb is utiliser with more than 2000 occurrences ; the next verb fonctionner plummets to around
300 occurrences, and the rest are even rarer. This shows that French translations of verbs are different depending
on the uses ; among the various uses of to run, one is translated by fonctionner, which is also the translation of
to work. Using several verbs to translate one only term reduces the frequencies of French verbs.
For this reason, describing and teaching the most frequent verbs is not satisfactory. Among the less frequent
verbs in the references corpus are verbs that must be taught because they are especially difficult for Frenchspeakers.
Our concordancer allows us to query the corpus on character strings or with perl-like regular expressions
containing syntactical categories such as nouns, verbs, adjectives, etc. As shown in Figure 1, the perl-like regular
expression (have|has) \w+ed looks for two sequences of words : either have or has followed by a word ending
in –ed. This search string defines occurrences of present perfect verb forms :
Figure 1 : Present perfect occurrences
A first query searching for all the terms that can be considered as verbs provided us with a more precise list than
just the frequency list. This query is important because of the great differences existing between French and
English. We picked out verbs like to mirror or to cache which are not frequent in the corpus (less than a hundred
occurrences each), but which can cause difficulties, since there are no verbal equivalents in French. *Miroirer9
or cacher are not good candidates.
A second type of query dealt with the context in which each verb can be found individually, in order to extract
their distributional and transformational properties. These examples of concordances were also edited for
presentation to the students. What was at stake consisted in making the students aware of the verbs behaviour
via the contact with authentic data. Data-driven approaches for language teaching often recommend comparing
the examples extracted from a corpus with the descriptions that can be found in reference books (B. Dodd 1997
in Wichman et al. for example). This is not possible to achieve with CS English as there are no descriptions of
CS English verbs. The comparison with the general English uses, however, can lead to extract specialised
English verbs.
Our reference corpus Linux HOWTOs has been translated into different languages. Our English corpus can be
aligned with its French translations. The French and English corpora were aligned, paragraph by paragraph by a
perl script developed with, and included into, our series of tools (the Wall environment). Since the alignment is
not always perfect (translators can decide to add or delete sections), the corresponding paragraph can then be
manually searched for. Our tool allows the user to query either one of the corpora (cf. Figure 1) and then to
search, for each occurrence of a verb, for its equivalent in the other language (cf. Figure 2).
9 Words preceded by an asterisk do not exist in the given language.
Figure 2 : Aligned paragraphs for « announce »
After examining concordances to discriminate between the different uses of each verb, we looked for the
possible French translations for each use in the French translated corpus. Our aim consisted, on one hand, in
refining the description of English verbs, and on the other hand in matching the different French equivalents.
We did the same with the French corpus : analysis of the different uses, and searching for the English
3. Illustration with a few verbs
We show here how querying corpora can reveal the diversity and variety of uses of verbs in CS English.
Working on corpora allowed us to describe three types of verbs that are typical in English : neologisms,
specialised uses of verbs that already exist in « general English », and « general English » verbs that are
extremely frequent in CS English.
The results of corpus query can also reveal the potential difficulties that French-speakers have. Comparing CS
English verbs with their French equivalents, but also with verbs in « general English » allowed us to highlight
differences, especially in the first two types of verbs. The difficulties that French-speakers can have with verbs
of the third type – general English – are common for all French-speakers in general. Basically, we postulate that
two main factors are responsible for the errors that French-speakers make in English : interference from the
mothertongue and overgeneralization of rules in the second language (Kübler 1995).
3.1. Verb/noun ambiguity in the neologisms of CS English
Neither the frequency list nor the list of terms tagged as verbs are enough to cover all the verbal neologisms that
are created from technical or proper nouns. The terms we are looking for are not necessarily tagged as verbs in
our working dictionary10. Reference books are of little benefit either. In textbooks for teaching CS English, these
verbs are never clearly explained. English dictionaries of computing or bilingual glossaries of computing (be
they hard-copy dictionaries or on-line glossaries that can be found on the Web) contain many nouns, but do not
mention verbal uses. For example, although the Dictionary of Computing, published by Oxford University Press
10 We use a specific dictionary to tag our corpous, e.g. a list of words with part of speech categories. Information
extracted from our corpora allowed us to complete our dictionary.
and aimed at learners of English as a second language) is very complete, it does not offer any information of this
With this in mind, we looked for inflected forms of verbs, i.e. words ending in –ed,-ing, and –(e)s. This type of
verb is regular because the simple past and past participle are built by simply adding –ed to the root. An even
finer selection can be made by searching the concordances for more complex verb forms, such as have, been, or
being followed by a word ending in –ed for instance. Verbs, such as to ftp, to rlogin, to telnet, to gzip, to
Mosaic were extracted in this way. The verb to zip is derived from the nous zip, hence the inflected forms zips,
zipping, zipped. In this case, the relationship between verb and noun is clear as is the syntactic structure of the
verb :
(6) You can zip the file and attach it to your message
The term in use in French is as simple as in English :
(7) Vous pouvez zipper le fichier et le joindre à votre message
For other verbs, the relationship can be more opaque ; to FTP is derived from the acronym FTP (File Transfer
Protocol), to Mosaic stems from Mosaic which is the name of the first browser of the World Wide Web :
(8) The latest source can be FTPed from the directory ftp…or Mosaiced from http
In this case, the English context alone is not enough to establish the basis syntactic structure of the verbs. Their
meaning remains unclear to a layman. French-speakers can have comprehension problems and may even
misinterpret the sentence. The possibility we have of verifying the French equivalent in exactly the same context
is therefore extremely useful. The French translation of the above example is :
(9) On peut charger la dernière version sur ftp … et sous Mosaic depuis http …
In French, the creation of neologisms, such as *ftper for example, is subject to more constraints than in
English11. French translators of such technical texts often have recourse to paraphrase based on the noun from
which the English verb has been derived. Describing structures in French and in English for the two verbs to ftp
and to Mosaic for example, means describing very different structures. French uses charger une version sur ftp
(on ftp), but sous Mosaic (under Mosaic). However, examining all the occurrences of to FTP in the corpus
suggested other possible translations :
(10) a. You can ftp it from
b. Vous pouvez l’obtenir par
Working on bilingual corpora highlighted this diversity and showed that an English technical verb often has no
stable translation in French ; that is why it is necessary to collect all possible equivalents. As we were checking
the English equivalents of the French expressions, we also found an English paraphrase around the noun FTP :
(11) a. It can be obtained by anonymous FTP from
b. On peut l’obtenir en faisant un FTP anonyme à partir de …
When the rules of euphony allow it, some creations coexist with periphrastic equivalents :
(12) a. They must telnet to the firewall
b. Il faut se connecter au firewall par le réseau
(13) a. Only the administrator can telnet directly to the firewall via Port 24
b. ?Seul l’administrateur peut télnéter directement le firewall sur le port 24
11 Here *ftper does not exist probably for euphony reasons
The first translation represents an explanation of the telnet process ; the second one is quite surprising since from
a prepositional verb in English (Nhum telnet to Nmachine12) a transitive verb (which is a loan translation) is
As there is only one occurrence of the French verb télnéter in the corpus, the acceptability of sentence (4F) is
questionable, although all the rules concerning the coining of new words have been respected. In this case,
combining frequency and structure can be useful to define the scope of the vocabulary to be taught : a structure
which is both rare and doubtful should be discarded.
One of the major problems concerning verbs in computer science is the lack of regularity in translating them
from English into French, and the divergences between norm and usage : standardized terms by an official body,
such as the Commission Ministérielle de Terminologie Informatique. are not always used, while deprecated
terms can be knowingly used because they are the ones that are used by the whole CS community13.
The verb to boot, which is quite frequent (700 token occurrences), illustrates this issue. Here again, reference
books, such as dictionaries or textbooks, are of little help. The on-line Merriam-Webster’s14 does not give any
definition of to boot related to computer science : the given meanings are to avail, to profit. There is no verb
entry for to boot in the Collins-Cobuild. Among the on-line dictionaries that are available on the Web, Wordnet15
is a little more complete because there is a definition for the specialised use of this verb in computer science (n°2
below). However, the information concerning the arguments of the verb or its syntactic structure is not
sufficiently full:
(14) Boot : kick ; give a boot to
(15) boot : cause to load (an operating system) and start the initial processes
Another on-line dictionary that is specialised in computer science (FOLDOC) tells us that to boot comes from to
pull oneself up by one’s own bootstraps ; the original meaning of this expression (« to do something without
help ») has been transferred to a verb to bootstrap :
(16) Bootstrap : (From « to pull oneself up by one’s bootstrap »)
To load and initialise the operating system on a computer.
Normally abbreviated to « boot »
The original verb to boostrap is no longer used very often in CS English, according to our corpus evidence; only
thirteen tokens, out of which only two verbal uses can be found in the corpus :
(17) a. This is useful to bootstrap Linux on a system with only one floppy drive
b. Ceci est utile pour démarrer Linux sur une machine qui ne possède qu’un lecteur de disquettes
In France, the translation standardized by the Commission de Terminologie Informatique of the Ministry of
Culture is amorce for the noun, and amorcer for the verb ; these are specialised uses of already existing terms
that roughly mean « start ». However, if the noun amorce can be found in our French corpus, the verb amorcer
occurs very rarely. Looking for the French equivalents of the verb to boot in the French corpus reveals
démarrer, lancer, and less often the anglicism booter :
(18) a. You can specify various hardware parameters before booting the Linux kernel.
b. Vous pouvez préciser différents paramètres matériels avant de démarrer le noyau
(19) a. The system doesn’t boot at all
b. Le système ne boote plus du tout
(20) a. LILO is a program that will allow you to boot Linux
b. LILO est un programme vous permettant de lancer Linux
12 We use here the notation used in the theoretical and methodological frame of the lexicon-grammar, in which
for example Nhum represents a human noun, i.e. all the nouns that can be considered as humans (girl, driver,
linguist, guy, etc). M. Gross, 1975 : Méthodes en Syntaxe, Klinsieck : Paris.
13 This is particularly true in the GNU initiative and Linux community.
Doing the job the other way round, i.e. analysing the English equivalents of démarrer, and lancer, not only
allowed us to confirm to boot, but also to discover to run, to launch, to type, and to issue for the French lancer.
Using English verbs can thus rapidly become quite complex for a French-speaker. Comparing English and
French verb concordances shall allow students to find out in which context these verbs can be used.
The French booter and amorcer are unequivocally translated by to boot ; booter being a nonce borrowing, and
amorcer a new use of the verb which has been especially created to give an official French equivalent to the
English to boot. Analysing the concordances reveals precise indications of when to use the translations démarrer
and lancer (or se lancer in some cases). Generally and with very few exceptions, to boot is used for démarrer
and lancer when dealing with starting an operating system.
We show here what type of linguistic information can be extracted from the corpus. This information will be
used in the preparation of pedagogical material, and for the automatic generation of exercises.
i) To boot is an ergative verb, i.e. the action can be described from the point of view of the agent or of the
one that is affected by the action. The basis structure of this verb has three arguments and the subject is
the agent of the action16 :
N0 boots N1 Prep N2 with the following arguments :
N0 = : Nhum or Nbootappl (= application software allowing the system to boot, such as LILO)
N1 = : Nbootobj (= all the objects that can be booted : operating system, disk, bootdisk, hard disk,
floppy disk, kernel)
Prep = : With, from, off
N2 = : Nbootingobj (= booting objects, e.g. CD, CD-ROM, D :, C :, A :, file, emergency disk)
EN To boot one of your old kernels off the hard drive…
FR Pour lancer l’un de vos vieux noyaux à partir du disque dur…
EN A good idea might be to boot the notebook with a kernel
FR Une bonne idée serait de démarrer le portable avec un noyau
EN In order to have LILO boot Linux from OS/2 Boot Manager,…
FR Afin que LILO lance Linux à partir du gestionnaire de démarrage d’OS/2, …
The corpus allows us immediately to detect the variety of English prepositions and how they are translated into
French. Analysing the sentences with a three-argument structure enabled us also to build up a list of arguments
for each position.
ii) A simple transitive sub-structure is possible : N0 boots N1
N0 = : Nhum + Nbootappl
N1 = : Nbootobj
EN LILO is a program that will allow you to boot Linux
FR LILO est un programme vous permettant de lancer Linux
iii) The intransitive form in which the argument in the position of subject represents the element that is
affected by the action is the following : N0 boots, with N0 = : Nbootobj
EN When Linux boots, it is usually configured not to produce…
FR Quand Linux se lance, il n’est habituellement pas configuré pour…
iv) A prepositional structure, in which the object in the N1 (first object) position is assumed to have been
deleted, is also quite common : N0 boots Prep N1, with N0 = : Nbootobj, Prep = : to :
EN Your BIOS may not allow you to boot directly to a SCSI drive.
FR Votre BIOS ne vous permettra peut-être pas de démarrer directement à partir d’un disque SCSI
EN Your BIOS mau not allow you to boot to a Linux installed there
FR Votre BIOS peut ne pas vous permettre de démarrer un système Linux qui y serait installé
16 N0 is the noun in the subject position, N1 the nouns in the object position, and N2 the nouns in the position of
second object.
In this context, lancer can also very rarely be translated by to launch, which is a more general verb. In radically
different contexts, such as lancer une command, to run, to issue, and to type can be found.
The structures and arguments described above show the difference between the general verb to boot and the
highly subject-specific neologism to boot. Apart from the distinct etymological origin (which is however not
very useful from a synchronic point of view), the neologism to boot presents structures, as well as arguments,
that are very different from the general verb. This is illustrated by the two examples below, which have been
extracted from a concordance on the Herald Tribune :
(21) In early 1988 the Saudis booted out Hume A. Horan
(22) …eating habits under control by booting the French chef and his staff. The next…
The next sub-section deals with the problem of verbs that already exist in general English, and that also have
highly technical uses.
3.2. Specialised uses
Numerous verbs existing in general English can be found in the computer science subject area with specialised
uses that are very different from the general English meaning. Comparing the candidates with their French
equivalents, but also with their general English uses allowed us to isolate the subject-specific uses, as shown in
the examples below :
To save
HOWTO Herald Tribune
These settings will be saved for you
Cette configuration sera sauvegardée
to save court time
he turned to the church to save his skin
the government hopes to save hundreds of millions of dollars
These example show that the arguments of the verbs are very different in CS English ; the French translation of
to save in its specilized use is sauvegarder, whereas in the three general uses given above, the verb will be
translated by gagner, sauver, and épargner, respectively.
As was already shown in the case of neologisms, comparing an English verb with its French equivalents allowed
us to underscore uses that are unknown by French-speakers. The a priori meaning of to post in CS English is
« to send a message by e-mail, especially to a newsgroup », which is confirmed by the French translation below :
(23) a. Everybody should have a look through this section before posting for help
b. Tout le monde devrait y jeter un coup d’oeil avant d’envoyer un message demandant de l’aide
The meaning of the following example is completely different :
(24) a. Called by the kernel when the card posts an interrupt
b. Appelé par le noyau quand la carte déclenche une interruption
The distance between general use and specialised use is on a continuum between « almost general » and
« completely specialised ». Command terms that are used with an operating system like UNIX and Linux can be
integrated into sentences as verbs with very specialised meanings. The technical use of to quit for example, is
close to its general meaning, i.e. « to get out of a session ». In the e-mail application running under UNIX or
Linux, quit is a command whose function is to leave the application without saving deleted messages ; the
meaning of verbs and the name of commands merge together when the name of a command is integrated into a
sentence as a verb. In this case, the use of the technical verb is very different from its general use.
To kill which means « to suddenly stop a process » is not as close to its general use, although the French
translation tuer can be found, as well as détruire. Finally, the relation between general and specialised for to zip
( French : compresser) and to unzip (French : décompresser) is very distant.17
These verbs are quite numerous, and some of them are also very frequent, like to run for example. To run has
various uses, and is a frequent verb in CS English (according to our corpus evidence) ; in our corpus of general
newspaper (The Times) it is quite frequent as well (cf. 3.2. Frequencies), but with other meanings. However,
17 The neologism to gzip has been created on the basis of to zip
very few indications about its specialised uses can be found in reference books. Computing dictionaries18 do not
mention it. Among the thirty or so uses given by the Merriam-Webster’s, only one is related to computing : to
run a problem through a computer, a use that is quite rare in CS English. This use can be found in the Collins-
Cobuild, but along with another one : You don’t need a degree in mathematics to run (= operate) a computer. A
quick check in the HOWTOs and RFCs corpora gives the following result : there are only four occurrences of
run something through in the HOWTOs, and none in the RFCs. Moreover, the arguments of to run do not match
with the ones found in the dictionaries :
(25) a. Dictionaries : To run a problem through a computer
b. Corpus : If you run your file through TeX program
Scanning bilingual dictionaries gave us the following translations : exécuter, passer, fonctionner, être en
marche, and utiliser. We then analyzed the occurrences of to run in the corpus. This showed us that the above
translations are not the only ones in use, and gave us complete information about the phraseology of the different
uses. We give here a few examples of the two basic uses of to run and the various translations that can be found
in our corpus:
i) to run  lancer, exécuter
(26) a. You forgot to run LILO or system doesn’t boot at all
b. Vous avez oublié de lancer LILO ou le système ne boote plus du tout
(27) a. It just runs a command…
b. Il ne fait qu’exécuter une commande…
(28) a. …32-bit code that runs in 16-bit mode…
b. …du code 32 bits qui s’exécute en mode 16 bits…
ii) to run  faire tourner, tourner, fonctionner
(29) a. You can run Linux on any Alpha-based machine
b. Vous pouvez faire tourner Linux sur n’importe quelle machine Alpha
(30) a. The ability of any Alpha-based machine to run Linux (patient in subject position, active
b. La possibilité de faire tourner Linux sur une machine Alpha (operator faire => introduction
of a third argument in the subject position)
(31) a. If the same program is run on a 21064… (passive voice, patient in subject position)
b. Si le même programme tourne sur un 21064… (active, patient in subject position)
The choice of the preposition on and under depends on the arguments in the subject and object positions :
application software and operating systems run on a machine or an operating system, while application software
runs under an operating system :
(32) a. VirtuFlex runs on standard UNIX Workstations
b. VirtuFlex tourne sur des stations UNIX standard
(33) a. ANSFORTH system that runs under Win3.2, Win95, WinNT
b. Le système ANSFORTH qui tourne sous Win3.2, Win95, WinNT
These examples show how corpus analysis can highlight the great variety of existing structures and arguments,
as well as the relationship between structures and transformations. Extracting the left and right context of verbs
enabled us to obtain a list of possible arguments which had to be checked with an expert in computer science.
How can the layman know that LILO is a boot program, that inetd is the noun of a program, or that Pentium,
which is the name of the microprocessor,is a metonymy for « computer » ?
The comparison with uses in general English can help isolate technical verbs. The uses described above cannot
be found in general English ; on the contrary, it is possible to find structures that never appear in CS English :
(34) …become a presidential concern about running for re-election in 1996…
(35) …stamps, old coins, and odd documents, run around the square. Cafés and…
3.3. General English verbs
18 FOLDOC, A Glossary of Computing Terms, Dictionary of Computing For Learners of English
Teaching CS English verbs cannot concentrate solely on highly subject specific or specialised verbs. Some
general verbs are quite often used in CS English.
Comparing general corpus with specialised corpus for non-specialized verbs showed up differences in the
frequency for different uses. While a general verb has several general uses, only one can be found in CS English.
To install is more frequent in CS English, than in our general corpus ; in the computing field, it is used in only
one type of context :
(36) You must configure and install an appropriate kernel and then install the AX.25
In the computing context, it is only programs which can be installed. In the Herald Tribune, in contrast,
occurrences of install have been found in structures in which a human argument is in the position of direct
object :
(37) the country’s new president, who was installed in January. He was…
Technical uses of install occur much less frequently in general English :
(38) by having a catalytic converter installed in her old-fashioned Volkswagen Derby
Noun uses can be different in technical and general English : in general English, the noun is installation, while in
CS English, the mostly used noun is install. Verb/noun ambiguity can thus be more difficult to resolve in CS
Another problem related to verb/noun ambiguity lies in the structural differences between a verb and a noun, in
French and English. Access is an example of this difficulty. In English, the noun is followed by the preposition
to ; the French noun accès is followed by the preposition à. Access however is also a transitive verb in English ;
whilst, the French verb is followed by a preposition : accéder à.
(39) a. Postgress95 which provides simple access to any existing database
b. Postgress95 qui fournit un accès à n’importe quelle base de données existante
(40) a. The user can access the system
b. L’utilisateur peut accéder au système
 *The user can access to the system
Adding a preposition after to access is a very common mistake among French-speakers19.
This shows how useful it can be to look for general English verbs in specialised corpora.
4. Conclusion
Developing a linguistic description is not an easy task in a highly technical subject area. The linguist cannot rely
on intuition because s/he does not have the necessary technical knowledge ; information found in reference
books is of little help.
Using and relying on authentic documents is therefore absolutely necessary ; contrastive work on bilingual
corpora allowed us to list the characteritics of technical verbs. It has also enable us to identify differences in the
use of specific structures between French and English. The observation of the English equivalents of the French
verbs threw new light on the relationships between the different uses of a verb in English.
The current linguistic description needs to be refined : the description of the structures is not coupled with
systematic statistic information.
Concerning the teaching of CS English, compiling a learner’s corpus should help us complete our teaching
objectives. A corpus-driven description of learner’s English would lead to the description of linguistic reality. As
stated in Granger and Tribble (1998) compiling and analysing a corpus of the non native learner allow the
linguist to highlight the learner’s difficulties, and therefore to decide what must be taught.
Working on corpora permitted us to achieve two aims : on the one hand, to show students the different verb
structures from contrastive concordance samples and then to allow them to look for equivalences in the parallel
19 We frequently noticed it among undergraduate students.
corpora ; on the other hand, it enabled us to make a description of the verbs focusing on differences and on
potential problems. The description was then used to generate exercises automatically.
Gap-filling exercises can be produced on concordances. It is possible to ask students to find the correct
preposition after to run for example :
Figure 3 : Gap-filling exercise : the student must fill in the gaps with the correct preposition.
The more precise the linguistic description is, the more sophisticated the exercises can be. Moreover combining
a linguistic description with a corpus-based French-speakers’ errors (see Cornu et al. 1993) will lead to the
automatic correction of less restricted exercises (than gap-filling exercises) which need precise grammarchecking.
Bosworth-Gerome S., Ingrand C., Marret R., 1992 : Comprendre l’anglais scientifique et technique. Ellipse :
Brookes M., Lagoutte F., 1993 : English for the Computer World. Belin : Paris.
Cornu E., Kübler N., Bodmer F., Grosjean F., Grosjean L., Lewy N., Tschichold C., Tschumi C., 1997 :
« Prototype of a Second-Language Writing Tool for French-Speakers Writing in English ». Natural Language
Engineering, 2(3).
Granger S., Tribble C., 1998 : « Learner Corpus Data in the Foreign Language Classroom : form-focused
instruction and data-driven learning » ; in : Granger S. (ed ;) Learner English and the Computer. Longman :
Foucou P.-Y., Kübler N., 1999 : « A Web-based Environment for Teaching English : General Architecture ».
ReCall, special issue.
Foucou P.-Y., Kübler N., 2000 : « A Web-based Environment for Teaching Specialised English. ; in Lou
Burnard and Tony McEnery (eds.) Rethinking Language Pedagogy: papers from the third international
conference on language and teaching. Peter Lang GmbH : Frankfurt am Main.
Hoffman L. 1985 : Kommunikationsmittel Fachsprache. Günter Narr Verlag : Tübingen .
Johns T., 1988 : « Whence and Wither Classroom Concordancing » ; in : Bongaerts, T. et al. (eds) Computer
Applications in Language Learning, 9-27. Foris : Dordrecht.
Kübler N., 1995 : L’automatisation de la correction d’erreurs syntaxiques : application aux verbes de transfert
en anglais pour francophones. PhD thesis, Université de Paris 7, publications de l’Institut Gaspard Monge :
Université de Marne La Vallée.
Pearson J. 1998 : Terms in Context. John Benjamins Publishing Company : Amsterdam .
Trimble R. M. T., and Trimble L. 1978 : « The Development of EFL Materials for Occupational English : The
Technical Manual ». In : R. M. T. Trimble, L. Trimble and K. Drobnic (eds) : English for Specific Purposes.
Science and Technology. English Language Institute : Oregon State University.
Vance S., 1995 : « Concordances with Language Learners : Why ? When ? What ? », CAELL Journal, vol.6,
Wichmann A., S. Fligelstone, A. McEnery and G. Knowles (eds) 1997. Teaching and Language Corpora.
Longman : London.
No comment yet.
Scooped by kukuwka!

Tim Johns' page

kukuwka's insight:

Many usefil links for those who want to teach with help of corpora

No comment yet.
Scooped by kukuwka!

Discovering Language through Corpora

Discovering Language
through Corpora*:
the Skills Learners Need
and the Difficulties they Encounter
francesca bianchi
elena manca
Università del Salento
Most scholars agree on considering corpora as a valuable source of linguistic
information for native and non-native speakers alike. Few researchers, however, have dealt with and systematically analysed the objective difficulties encountered by students while trying to exploit corpus data. The current paper
describes a quantitative study of corpus consultation by learners and aims to
establish whether different corpus analysis tasks can be considered to have different degrees of intrinsic difficulty. To this end, 26 corpus project work assignments produced by two different groups of students were assessed and tagged
according to specific parameters that reflect the skills needed in corpus analysis.
The data were analysed applying both parametric (ANOVA) and non parametric
tests (Mann-Whitney U-test), which showed that, despite clear individual and
teaching/learning environment differences between the two groups of students,
the students’ results in most of the tasks were due to different levels of intrinsic
difficulty. This led to the creation of a General Difficulty List of Corpus Analysis
Keywords: corpus analysis, skills, student difficulties, analysis of project works,
teaching planning.94
1. Introduction and research question
Most scholars agree on considering corpora as a valuable source of linguistic
information for native and non-native speakers alike. For this reason, many
linguists have been increasingly advocating the use of corpora in language
learning/teaching (Aston 2001; Cobb 1997; Flowerdew 1993; Levy 1990, 1997;
Owen 1996; Sinclair 2003, 2004; Steven 1991; Tribble & Johns 1997). The possible uses of corpora in language learning and in translation have been widely
discussed (Frankenberg-Garcia 2005b; Gavioli 2005; Gavioli & Zanettin 1997;
Granger & Tribble 1998; Sharoff 2004; Tognini Bonelli 2001; Tribble & Jones
1990, 1997; Zanettin 2002; Zanettin et al. 2003), although some authors have
illustrated the need for corpora specifically created for pedagogic purposes
(Braun 2005). Furthermore, some researchers have suggested direct student
access to corpora (Gavioli & Aston 2001), and others have described the serendipitous discoveries that students have made while directly accessing corpora
(Bernardini 2000a, 2004; Bernardini & Zanettin 1997). Few researchers, however, have dealt with and systematically analysed the difficulties encountered
by students while trying to exploit corpus data. A brief review of the major papers on this issue is provided below.
One of the first authors to deal with the processes and results of students’
corpus exploration is Bernardini (2000b). Her paper focuses on students using the British National Corpus. Her observation of how the students approach
corpus investigation reveals some problematic tendencies, including the fact
that the students often ignore variants, do not look for alternative and more
successful approaches and tend to make only summary analyses.
Kennedy and Miceli (2001) provide a fairly detailed qualitative analysis of
the way students proceeded in using a corpus as a reference for writing in a foreign language (Italian). They consider four steps in corpus investigation: formulating the question; devising a search strategy; observing the examples and
selecting relevant ones; drawing conclusions. Their recordings and interviews
show that students have problems with all the steps considered, which led the
authors to devise some tips for each step, so as to guide the students towards
more precise and fruitful research practices.
Sun (2003) analyses the learning process and the strategies used by three
undergraduate English FL learners when accessing corpus data to proofread
texts with grammar mistakes. She also examines the factors that impact on the
students’ behaviour. The students received a relatively quick introduction to
concordance analysis, and their problem-solving strategies were collected using a think-aloud protocol. This author classifies four cognitive skills required
in the analysis of concordance lines, namely: comparing; grouping; differentiating; and making inferences. From Sun’s description, it seems that the three
students went constantly through all the phases mentioned, even though the
teacher’s help was at times needed for the students’ correct progress. The author concludes by stating that four factors influenced the learners’ investigations and the strategies they used: prior knowledge; cognitive skills; teacher
intervention; and skills in using the concordance software. discovering language trough corpora 95
Another paper mentioning and analysing student difficulties in corpus use
is by Yoon and Hirvela (2004). The major focus in their study, however, is on
student responses to corpus use so that the analysis of problems/difficulties is
carried out with the goal of providing evidence for student likes and dislikes.
The types of difficulties they take into consideration revolve around what the
students feel as problems in accessing the corpus and include matters such as:
data analysis is time consuming; concordance output provided too many or too
few sentences; texts or chunks were difficult to read or included unknown vocabulary; Internet connection was too slow or not available. Only one item in
their list generally refers to difficulty in ‘concordance output analysis’.
Chambers (2005) examines the strategies generally used by her students
in accessing corpora, and their efficacy or otherwise. This was part of a study
designed to ‘examine a number of aspects of course design in corpora and language learning involving direct access by learners’ (Chambers 2005: 112) and to
‘draw some conclusions concerning the factors that favour the integration of
corpora and concordancing into the language-learning environment and the
obstacles which remain to be surmounted’ (Chambers 2005: 112). Her discussion is based on qualitative analysis of 11 end-of-course essays. Her data highlighted
a considerable amount of variation in the students’ ability to explore the corpus (Chambers 2005: 119), which led her to conclude that “differences in motivation or learning
styles may explain the considerable variation in the success of the activity. In addition to
the variation in analytical ability, there was also considerable variation in the students’
ability to reflect on the nature and limitations of the corpus, an ability which came easily
to some students, but was totally lacking in others (Chambers 2005:119).
Finally, Frankenberg-Garcia has dedicated more than one study to this issue.
Her 2005 paper focuses on translation students and how they combine the use
of corpora, termbanks, the Web and printed references. Her plenary speech
in 2006 at the 7
TALC Conference (Paris)
, provided a detailed description of novice users’ problems in accessing corpus data and presented task-based, noncorpus-specific
conscious-raising exercises aimed at helping [novice users] gauge different corpora
and discern which ones are best suited to their purposes, develop basic corpus-searching strategies, and get used to interpreting corpus data (Frankenberg-Garcia 2006: 5).
Her list was inspired by a general review of the literature as well as by personal observation of the way students used the COMPARA corpus. Her comments
and exercises focused on issues such as problems in choosing a suitable type of
corpus or sub-corpus, formulating corpus queries and follow-up queries, and
interpreting corpus data.
The above-mentioned studies are substantially different in terms of focus
of interest, the way they were conducted, the types of students involved, the
teaching objectives of each course/module, and the way corpora were introduced to the students. Furthermore, their results are frequently rather contex-96
tualized. However – quoting from Frankenberg-Garcia (2006: 5) – ‘they all converge to suggest that corpus skills which come as second nature to experts are
not obvious at all to the untrained’. This was previously pointed out by Sinclair
(2004: 2) when he stated that corpora are not a simple object and that lack of
training and experience in retrieving data may lead students to consider nonsensical conclusions as insightful ones. Thus, the teachers who decide to adopt a corpus approach to language teaching/learning should be aware of the difficulties
that this applied discipline involves and pace the training according to the skills
one might expect from students. Meaningful corpus analysis requires not only
good knowledge of the basic theoretical concepts of the subject, but also practical
experience, as well as skill in using concordancers and in observing, identifying,
classifying, and generalizing data.
The current paper attempts a systematic analysis of the difficulties encountered by students in approaching language through concordancing. Attention
is given to the phases that follow concordance line retrieval and which include
tasks such as selecting concordance lines, categorizing collocates, analysing collocation and colligation, and using the data retrieved to make generalizations
about language or to find a suitable translation equivalent.
As a general hypothesis we may presume that the performance of a task depends on: 1. the difficulty of the task itself (intrinsic difficulty); 2. individual factors,
i.e. individual abilities and background knowledge; and 3. environmental factors,
such as course and exam focus. So far, corpus linguists do not seem to have analysed intrinsic difficulties in corpus analysis tasks. Starting from empirical observations, we developed the following working hypothesis: if two different groups of
students show similar difficulty in performing specific tasks, the influence of individual and teaching/learning environment factors can be considered less relevant
than task-intrinsic difficulty. The following sections describe how this hypothesis
was tested using two randomly selected groups of students.
2. Design of the study
Two separate groups of foreign language students specializing in translation
studies participated in this study: 40 bachelor students from the University of
Lecce, and 10 MA students from the University of Genoa. Both groups were introduced to corpus consultation and analysis and were asked to complete an end-ofcourse corpus research assignment. The assignment papers were analysed using
a specially developed taxonomy of twelve corpus analysis tasks. Analyses were
carried out at individual, group, and general levels.
2.1 Participants
Two groups of Italian students participated in this study: 40 undergraduate students enrolled at the Faculty of Foreign Languages of the University of Lecce, and
10 MA students enrolled at the Faculty of Foreign Languages of the University
of Genoa. The two groups attended separate courses on how to use corpora for
analysing language and finding translation equivalents: the courses were held by
the authors of this paper (hereafter, researachers or we), one in Lecce and one in
Genoa. None of the students had ever heard of corpora or corpus analysis before, discovering language trough corpora 97
Feature Lecce Genoa
Native tongue Italian Italian
Course level Bachelor MA
Year 2
and 2
Number of students 40 10 (5+5)
Number of hours of lessons (including
60 hours 20 hours
Language in which the course was
English Italian
Language in which the project work
was carried out
Foreign language of
student’s choice
Languages of the comparable corpus
English - Italian
Language of student’s choice
- Italian
Level of proficiency in the FL of the
project work
B1/B2 B2/C1
Assignment Pair work Individual work
Table 1: The two groups participating in the study
except Student 102 who had very basic knowledge in the field. The students differed in terms of foreign language background, general academic background,
and familiarity with assignment writing. Moreover, they were exposed to different teaching methods. However, both groups of students were introduced to
corpus analysis tools and methods and were asked to submit a similar corpus
research assignment at the end of the course, which represents the rationale for
the comparison and contrast of their results. A schematic summary of the similarities and differences between the two groups is provided in Table 1.
As table 1 shows, the students were all Italian native speakers. Lecce students were
all specializing in English, their course was taught in English and the assignment
papers all analysed an English-Italian comparable corpus. On the other hand, the
Genoa group included students specializing in a range of different European languages; for this reason the course was taught in Italian and the students analysed
comparable corpora in Italian and a foreign language (FL) of their choice. The two
groups also differed in terms of proficiency level in the foreign language: B1/B2
in the European Framework of Reference for Lecce students, and the higher B2/
C1 for the students in Genoa.
The following section provides a brief description of the contents and teaching methods of the two courses. The description attempts to highlight similarities and differences between the two courses with respect to the tasks considered
in the current study. Contents unrelated to the tasks considered have been omitted for the sake of clarity and focus.98
2.2 Course contents
Both courses illustrated the following basic corpus linguistic concepts: corpora;
word lists; running and sorting concordances; collocation; colligation; phraseology; and semantic prosody. However, each of the researachers adopted an individual approach, partly due to the different number of hours and students characterising each course.
In Lecce, the course included two parallel modules: a 40-hour theoretical
classroom module, and a 20-hour practical lab module. Following the British tradition of Firth (1957), Halliday (1985), Sinclair (1996), Stubbs (1996), and Tognini-Bonelli (2001), the theoretical module introduced the students to the basic
corpus linguistics concepts mentioned above, plus the other relevant concepts
of context, meaning in context, and semantic preference. Furthermore, it explained how to find translation equivalents using comparable corpora (Tognini
Bonelli & Manca 2002). The practical module, which took place in a computer lab,
taught the students how to assemble their own corpora, use Wordsmith Tools (a
corpus concordancer), and retrieve and analyze data. When the students seemed
to be ready to work on their own, they were put in pairs, so that they could help
each other out, and tutored in performing a given series of tasks required for
autonomous use of corpora for linguistic analysis and translation. In this phase
of the course, the students were asked to run the wordlists of the two comparable
corpora they had created, search for the most frequent words in each wordlist,
compare the two wordlists, and look for mismatch in frequency between items
in the two wordlists. They were then encouraged to choose one or two English
content words, run their concordances, sort the concordance lines, and find immediate collocates and colligates. As a further step, they were asked to enlarge
the linguistic co-text in order to find collocates in N-2/3/4 and N+2/3/4. Once
they had identified the most frequent collocates occurring with the node word,
the students were asked to group the collocates into semantic fields, and identify
the recurrent phraseology of the node word and its patterns of use. As a last step,
they were invited to find Italian translation equivalents for each of the senses
identified for the node word.
At the end of the course, the students were asked to hand in a paper with the
following assignment (pair work): Choose 1 or 2 words among the most frequent in
your English corpus. For each word identify collocation, colligation, semantic preference,
and semantic prosody. Identify the phraseology around the node word. Identify possible
translation equivalents of the node word in your Italian comparable corpus using the
methodology seen in class.
In Genoa, all lessons were carried out in a computer lab provided with two
concordancing programs: Wordsmith Tools and ConcApp. Each lesson included
both theory and practice, for a total amount of 20 hours. The course focused on
the same basic concepts as the Lecce course, except semantic preference. The
students were also shown some ‘automatic’ retrieval features in Wordsmith Tools
not presented to the Lecce students: keyword lists; the Cluster feature (which retrieves n-grams); and the Collocate feature. The topics and order in which concepts were presented was loosely inspired by Sinclair (1991), Partington (1996),
and Bowker and Pearson (2002). Theoretical concepts were explained to the stu-discovering language trough corpora 99
dents using a seminar-like approach and every topic was immediately followed
by hands-on exercises. Examples and concordances to work on were given in Italian. When necessary, comparable corpora in other languages (either freely available on the Internet or provided during the course) were used to look for translation equivalents. When the students were considered sufficiently acquainted
with basic corpus analysis, concepts and techniques, attention shifted to translation problems and solutions. After a brief review of the issues of polysemy/homonymy, suggestions were given about how to use comparable corpora to find
translation equivalents, based on Tognini Bonelli (2001) and a simplified version
of Sharoff (2004).
Finally, the students spent some hours on guided review exercises aimed
at raising autonomy in the use of corpus tools. This work was carried out individually.
At the end of the course, the students were asked to carry out individual
project work and hand in a paper. The following instructions were given: Choose
3 or 4 words in one of the languages you study. For example, choose words that gave you
problems in your last translation, synonyms provided by a dictionary, near synonyms
whose subtle semantic differences seem difficult to distinguish, or simply 3 terms that
tend to show up in the same semantic context. Analyse each word, along with their translation equivalents. Compare the information provided by your corpus/corpora with that
provided by dictionaries.
2.3. Materials and Methods
The Genoa course produced 10 assignment papers, each one written by a different student. The Lecce course, on the other hand, produced 21 assignment papers,
as most of the students worked in pairs. However, in the current study only 16
assignments from the Lecce group were considered, since at the time when the
study was being carried out 5 papers were no longer available.
The student’s assignments were manually marked up using a tagging scheme
that was specifically and jointly developed by the two researachers. The tagging
scheme focused on 12 tasks that the researachers considered of primary importance in the given assignments; theoretical and practical explanations about how
to perform each of these tasks were given during the two courses. Table 2 provides a list of the tasks considered, along with the corresponding tags.
Collocation 2.7
Corpus 2.9
2.8 Meaning 2.5
2.8 Question dictionaries 2.5
Collocation 2.6 Semantic prosody 2.3
Phraseology 2.5 Semantic field 2.2
Generalization 2.4 Phraseology 2.2
Colligation 2.2 Generalization 2.2
Semantic prosody 2.0 Line selection 1.9
Semantic field 2.0
Table 4: Group results
Individual results were tabulated and statistical analyses were carried out on
both group and collective values, using SPSS. Analyses, which are presented and
discussed in Section 3, included distribution, calculation of mean and median
values, ANOVA and Mann-Whitney U test
3. Results and discussion
The students’ results are reported in the Appendix (Table A). Genoa students are
numbered 101-110, while Lecce students are numbered 201-216. For each student,
and for each task the following data are shown: mean value (Mean); number of
observations (N); standard deviation (Std. Dev.). Furthermore, the last column
shows each student’s overall mean result, considering all the tasks in the assignment. Finally, the last three rows in the table refer to the whole group of students
participating in the study. Table A in the Appendix shows great individual variation, both ‘among students’ and ‘within students’. Overall individual results
range from as low as 1.90 (Student 206) to as high as 3.40 (Student 101). Withinstudent variability is usually very high, with just one student (Student 107) showing a consistent mean in all tasks.
Group results in the different tasks are summarised in Table 4; tasks are listed
in decreasing order.discovering language trough corpora 103
l. sel.
Mann- Whitney
1129 410.5 79 44 24 80.5 419 915 918 47 438
P 0.64 0.000 0.06 0.90 0.65 0.000 0.45 0.07 0.84 0.29 0.44
ANOVA F 0.16 29.53 3.95 0.06 0.44 40.78 0.42 3.77 0.00 1.75 0.58
P 0.68 0.00 0.05 0.79 0.51 0.00 0.51 0.05 0.92 0.19 0.44
Table 5: Results to the second decimal place of Mann-Whitney U and ANOVA tests
Before attempting to comment on the differences between the results of the two
groups, we decided to carry out a statistical comparison, to see whether differences between the two groups could be considered significant. To this end, after
assessing the distribution of both group and whole-group results
, we decided to
apply both a parametric test (ANOVA) and a non parametric one (Mann-Whitney
U test), for greater certainty.
As Table 5 shows, the two tests gave the same type of results: in the vast majority of cases (eight tasks out of eleven: Collocation, Meaning Disambiguation, Semantic Prosody, Semantic Field, Phraseology, Translation equivalent, Question
Dictionaries and Corpus) the difference between the two groups was not significant; this is tantamount to saying that, as far as these tasks are concerned, the
two groups can be thought of as belonging to one and the same population. Thus,
the results of the two groups in these tasks could probably be considered primarily due to intrinsic difficulty rather than individual and environmental factors
(see Section 2.1 for a description of individual and environmental differences
between the two groups). The remaining three tasks (Line Selection, Meaning,
and Colligation), on the other hand, showed significant differences between the
two groups (P<0.05), which suggests that the difference in results is hardly due
to chance. This, however, does not rule out the existence of intrinsic difficulty in
these three tasks, but simply suggests that intrinsic difficulty did not emerge at
group level in the current experiment.
Let us now consider whole-group mean results. Following the hypothesis that
different tasks are characterized by different intrinsic difficulty levels, we can assume correspondence between lower mean results and greater difficulty of the
task. The ranking in Table 6 was obtained by listing whole-group mean results in
decreasing order. 104
Question dictionaries 3.0 Less difficult
Colligation 2.9
Line selection 2.9
Corpus 2.9
Meaning disambiguation 2.8
Translation equivalent 2.8
Meaning 2.7
Collocation 2.6
Semantic prosody 2.3
Phraseology 2.3
Semantic field 2.2
Generalization 2.2 More difficult
Table 6. General Difficulty List for Corpus Analysis Task
This list can tentatively be considered a General Difficulty List for Corpus Analysis Tasks. However, it should only be taken as a preliminary hypothesis of ranking of the tasks considered, as we believe ranking should be verified in further
studies on a wider population and a higher number of observations.
4. Conclusions
The current study sprang from the general observation that the results of a student in performing a corpus investigation task depend partly on the difficulty
of the task itself (intrinsic difficulty) and partly on external factors, such as the
student’s cognitive skills, and environmental factors, including course and exam
focus. The working hypothesis we formulated was that if two different groups
of students showed similar difficulty in performing some analytical tasks using
corpora, then, when it comes to those particular tasks, intrinsic difficulty could
be considered more relevant than the influence of external and environmental
Statistical analyses, which included calculation of mean results, normality
test, ANOVA, and Mann-Whitney, were performed on data from 26 participants
belonging to two different groups. Analyses showed that, despite the known differences between the two groups of students (environmental factors) and the
existence of individual differences among the participants, Genoa and Lecce students could be considered as a single population with normal distribution, in
almost all tasks. The statistical analyses also suggested that, in most of the tasks,
the students’ higher or lower results were probably not to be considered dependent on environmental factors, but rather on the different intrinsic difficulty of
each task. discovering language trough corpora 105
Consequently a General Difficulty List for Corpus Analysis Tasks was created
using whole-group mean results. This list takes into account the difficulties encountered by the students of both groups, who were exposed roughly to the same
course content, but differed in terms of level of studies, previously acquired analytical and research skills, course attended, teaching methods they were exposed
to, and assignment given. Although the General Difficulty List that emerged in
this study needs further verification on a wider population and a higher number
of observations, we believe that such a list could be of great significance when
designing courses that include the use of corpus analysis tools.
As a final rejoinder, this study leads us to suggest an analytical, rather than a
holistic approach to project work assessment. In fact, while tagging our students’
assignments, we noticed that our previous holistic assessments had, at times,
been influenced by factors such as each student’s fluency in expressing concepts,
the general level of presentation of project work, and the order in which assignments were assessed. Finally, an analytical approach when assessing students’
work may help avoid possible bias towards individual students based on their
previous results.
No comment yet.