Principles of corpus linguistics and their application to translation studies researchGabriela Saldanha
Centre for English Language Studies, University of Birmingham
1. Introduction
Corpora have been put to many different uses in fields as varied as natural languageprocessing, critical discourse analysis and applied linguistics, to mention just a few. As isto be expected, within each of those areas corpora fulfil different roles, from providing datato build statistical machine translation systems to revealing ideological stance in politically-sensitive texts. ‘Corpus linguistics’ is understood here in a more restricted sense, linked toBritish traditions of text analysis that see linguistics as a social science and language as ameans of social interaction where meaning is inextricably linked to the cultural andhistorical context in which it is produced. This article focuses specifically on the principlesof corpus linguistics as a research methodology, and looks at the implications of thisspecific approach to the study of language in translation studies.
2. A corpus defined in corpus linguistics terms
Because there is no unanimous agreement on the necessary and sufficient conditions for a collection of texts to be a corpus, the term ‘corpus’ can be seen in the literature referringsometimes to a couple of short stories stored in electronic form and sometimes to thewhole world wide web. In order to discuss the fundamental principles of corpus linguistics,it is important to first establish certain limits around what can and cannot be considered a‘corpus-based’ study of translation.Different definitions of corpus emphasise different aspects of this resource. The definitionoffered by McEnery and Wilson (1996: 87), for example, emphasises representativeness:“a body of text which is carefully sampled to be maximally representative of a language or language variety”. The problem with making representativeness the defining characteristicof a corpus is that it is very difficult to evaluate and it will always depend on what thecorpus is used for. A way around this problem is found in the definition offered by Bowker and Pearson (2002: 9): “a large collection of authentic texts that have been gathered inelectronic form according to a specific set of criteria”. Bowker and Pearson’s definition ismore flexible than McEnery and Wilson’s, even if the assumption is still that the corpus isintended to be “used as a representative sample of a particular language or subset of thatlanguage” (Bowker and Pearson, 2002: 9). However, in making selection criteria and notrepresentativeness the defining characteristic, Bowker and Pearson allow for a certainflexibility that reflects more accurately the fact that corpus representativeness is alwaysdependent on the purpose for which the corpus is used and on the specific linguisticfeatures under study. For example, a corpus that represents accurately the distribution of a common feature – say, pronouns – in a certain language subset may not representaccurately a rarer feature, such as the use of reported speech, in the same subset.Generally, corpora are intended to be long-term resources and to be used for a variety of studies, so representativeness cannot be ensured at the design stage.

