What is Corpus Linguistics?
Corpus linguistics is the use of digitalized text (corpus) or texts, usually naturally occurring material, in the analysis of language (linguistics). Techniques used include generating frequency word lists, concordance lines (keyword in context or KWIC), collocate, cluster and keyness lists. The plural of corpus is corpora.
What does one need to do corpus linguistics?
A personal computer (Windows, MAC, Linux, etc) is usually enough for small corpora. With it one can use a concordance program or concordancer to analyse plain-text files (extension “.txt”).
What does one need to know to do corpus linguistics?
To know the language you want to study is, of course, important. You also need to know some of the basic ideas in corpus linguistics, such as word list, frequency, type, token and concordance. Since these are the most basic and important concepts let us have a quick look at them.
The first thing you would want to do is make a word list. It is usually arranged from highest to lowest frequency of types. A type is a unique form of a word. A “word“ is defined as running letters separated by space or punctuation. Thus the sentence:
“To be or not to be; that is the question.“
Via Pascual Pérez-Paredes, oAnth - "offene Ablage: nothing to hide"