 |
 | Main Page |  |
CORPORA AND CORPUS TOOLS FOR TRANSLATORS
Much has been written on corpora by linguists, but not much by translators. Unfortunately, corpus tools are not designed for translators, or at least not for technical (non-literary) translators. These pages are an attempt to explain how corpora and corpus tools can be used by translators to make their work easier and improve quality. Please send any comments to:
The use of corpora is one of the best ways for translators to improve their translations on their own. A corpus is a collection of texts (not necessarily complete texts, but neither random words) in a machine-readable format that, together, make up a representative sample of a language or sublanguage. There are many sources for ready-made corpora. The basic idea is that the corpus is "experimental" data and corpus linguists are scientists analyzing the data collected. One of the main purposes of corpora is to find collocations for a given word. A collocation is "the habitual juxtaposition of a particular word with another word or words with a frequency greater than chance" (Concise Oxford Dictionary, 19th Ed.).
TYPES OF CORPORA
Corpora can be categorized in many ways, but here I will only present the types I feel are most useful for translators. General corpora contain texts that do not belong to a single text type, subject field or register. Specific corpora can be as specific as the creator desires, containing all the texts of a given author, all the texts published in a given magazine or newspaper, or a collection of contracts from many sources. These are also referred to as sublanguage corpora.
In addition to the specification above, corpora can contain texts in only one language (Monolingual) or in two or more languages (Bilingual or Multilingual). Obviously, a bilingual or multilingual corpus would be of interest to the translator! But wait, there is an important subcategory: when a corpus contains originals and translations, it is called a Parallel Corpus, and when it contains similar original (non-translated) texts in two different languages, it is called a Comparable Corpus.
General Monolingual Corpora
General monolingual corpora are available on the Internet for many languages or can be purchased on CD from some sources (usually academic). Examples are the BNC (British National Corpus) and the Cobuild Corpora. For access to general corpora for different languages, see David Lee's Corpora site or search on the Internet for the language and the word corpus. Two websites allow corpus-type searches of documents on the Internet: WebCONC for European languages and WebCorp for English. Most countries have at least one monolingual corpus. The Hellenic National Corpus (HNC), accessible online, allows you to search on two or three terms and indicate the number of words separating each term.
Specific Monolingual Corpora
Specific monolingual corpora are also available on the Internet, or you can create them yourself. If you translate many newspaper articles into language X, you can probably access a specific corpus with newspaper articles for that language or download articles yourself and create your own personal corpus. Wordsmith (Ref. 14) can be used for this. Wordsmith 4 even has a built-in tool to help you download texts off the Internet, or you can use a website copying tool like WinHTTrack. The type of information extracted from specific corpora is the same as for general corpora, but the results are more specific to the area covered by the corpus. For a specific corpus containing newspaper articles, words like corruption, politics, and party would be more common than in a general corpus.
Why should you go through all this work to create your own corpora? I translate medical texts from Portuguese into English. Mostly journal articles. Copying a few journal articles on the same topic (often found in the references of the article being translated) into text files and searching them can facilitate translation. When I first started out in translation, I looked for articles like these and read them, underlining terminology that might appear in my source text in the other language. Now, instead of reading them, I can search for certain words and read only the sentences in which they appear. Corpora are also great for the "guess and check" method. If you are sure of one word in a 2-3 word expression, you can search on the one sure word to see with which words it collocates in the specific sublanguage of your text/corpus. For example, when the original articles mentions some medical test performed, and I cannot find a translation through other means, searching for the words "test" and "tests" in a specific corpus containing only the articles in the references and looking at the context can at least give me a place to start. I can look at all the tests commonly used in this sub-area, look them up on the Internet, and see if one matches the description of the test in the original text. I could do this on the Internet, and spend all day looking. I could also read all the reference articles, but that would take a long time too. A specific corpus is also useful for the spelling of words that could end in -ic or in -ical (like electric and electrical). Note that we write electric train, electric switch, but electrical wiring and electrical installations. Medical terminology is similarly confusing.
Michael Wilkinson has written a wonderful article entitled "Using a Specialized Corpus to Improve Translation Quality". His Finnish translation students often translate tourism texts into English, and he is teaching them to use specific corpora containing English-language tourist brochures to create more natural-sounding texts.
Parallel Corpora
Parallel corpora are available online for many languages, but they often include only fiction (usually books and translations already in the public domain) or newspaper/magazine articles due to copyright restrictions. Sometimes you must download the files to your computer and align them yourself. The COMPARA corpus provides English/Portuguese parallel corpora which can be searched online free.
Translation memories are a type of parallel corpora created as you work, with the only selection criteria being what you yourself have translated. Wouldn't it be great to have access to the translation memories of experienced translators in addition to your own translation memories? Many good translations are available on the internet and can be downloaded and added to your personal corpus. By aligning them, you are creating a valuable translation memory (a.k.a. parallel corpus) based on these translations. Needless to say, you must be careful what you add to your corpus and not rely on it blindly.
Two translation environment tools were made to work with corpora: MultiTrans and LogiTerm. Your corpora can be searched easily, and can also be used as a translation memory while translating. Both these tools do alignment automatically with complex algorithms. Tools like DejaVuX and SDL Trados can be used for corpora lookups if you align your files, then feed them into a translation memory. They both come with alignment tools which do basic pre-alignment based on paragraphs and periods which you can then correct, if necessary. You might want to create a separate TM to hold these texts translated by others. Click on the names of the programs above to see screen shots and read about their capabilities.
All translation memory/corpus based translation tools have a menu option or button that allows you to look at previous occurrences of a source word in the parallel texts translated earlier-for example, Scan in DejaVuX and Concordance in SDL Trados. However, not all users of these programs know that you can also search your translation memories when you are not using the tool to translate a text (for example, when the original is not in electronic format). Wordfast also has an alignment tool, but I am not familiar with it. Note that the corpus lookup interfaces provided by MultiTrans and LogiTerm are much better than those provided by DejaVuX and SDL Trados, since they were created with corpora in mind.
Translators working with Canadian French can create an interesting legal corpus by downloading laws published in both French and English from the Canadian Department of Justice's web site. They make it easy... search for what you want in English, then click on the menu option for French and the exact same legislation is shown in French. They even provide definitions and translations of terms in the margins! The French government also has an official English translation of its Civil Code. The European Union also has many publications, including laws and patents, translated into many languages. All trade communities (like NAFTA) have some documents published in all applicable languages. Some universities have already separated out this information and compiled it. All you would need to do is download the text files and align them. The Europarl Parallel Corpus is an example: download text files in most European languages, labeled and ready for alignment.
Comparable Corpora
The objective of a comparable corpus is to compare texts without the distortions created by translation. If you are translating into language X, you collect documents similar to the document you are translating but originally written in language X. For example, if you need to translate a text on soil mechanics into English, you could search for various sites on the subject, copy the texts of a few good web pages to text files, and immediately use them as a corpus for your job (instead of searching for something in Google, search for it in your new specific corpus).
Another use of comparable/monolingual corpora is to discover connotations of words and their translations. This is important when translating into a foreign language or into a dialect other than your own. For example, the word "smell" in English usually carries a negative connotation unless some positive adjective modifies the word (good smell, nice smell, pleasant smell) or unless the speaker's tone of voice or expression indicates a positive feeling ("what's that smell" as speaker enters the kitchen, smiling). Note that "scent" is almost always positive unless related to animals. In contrast, in Portuguese, the word "cheiro" is positive by default, and requires a negative modifier to become negative ("cheiro ruim"). This is exactly the kind of information that can be extracted from a comparable corpus, and will be the basis for future bilingual dictionaries. It helps to choose between synonyms.
HOW TO BUILD YOUR OWN CORPORA
You can use TextSTAT, AntConc or Wordsmith to create a monolingual (or comparable) corpus. TextSTAT is the simplest, and both TextSTAT and AntConc are free. Wordsmith has a free version, but it is more complicated to learn to use than the first two. For a parallel corpus, just align the texts within your translation tool. A tool called ParaConc exists for parallel corpora, but I do not recommend it. Click on ParaConc in the previous sentence to read why.
HOW TO USE YOUR CORPORA DURING A JOB
In summary, there are various ways to use corpora when translating. I will present them again in order of time invested:
- Access a monolingual general corpora in the target or source language available on the internet. You can find these and store them in your bookmarks for easy access.
- Access a bilingual or multilingual corpus with your working languages available on the internet, if one exists. This may only work well if the subject area of the corpus matches the subject area of your job.
- Create specific monolingual corpora on the fly as you work, with the texts as similar as possible to the job text in subject area and register.
- Create specific bilingual corpora when you are not working on a job (or when a job is long-term or recurrent) with reliable translations provided by the client, a colleague or the Internet. Be very careful what you use, since a bad reference is worse than no reference at all.
For examples of how to use corpora during translation, see my article How to Use Linguistic Corpora to Improve your Translations.
Back to top |
|  |