Your new post is loading...
Los creadores de Perldoop. / Andrés Ruiz).
La ingente cantidad de información que se incorpora diariamente a iInternet no para de aumentar. Se estima que en sólo 24 horas generamos aproximadamente 2,5 trillones de bytes (2,5 exabytes), o lo que es lo mismo: cerca de 27 GB por segundo, el equivalente a una temporada completa de Juego de Tronos en Alta Definición (HD). De hecho, el 90% de los datos disponibles actualmente en todo el mundo han sido creados apenas a lo largo de los dos últimos años.
De esta enorme cantidad de datos (agrupados bajo el anglicismo Big Data), sólo el 5% se puede considerar información estructurada; el 95% restante (que está compuesto por textos, principalmente) no cuenta con ningún tipo de organización ni estructura, lo que representa un serio problema a la hora de acceder y gestionar toda la información disponible.
La herramienta adapta aplicaciones del ámbito del procesamiento de textos y documentos a modelos de computación
Ahora un equipo de investigadores del Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS), formado por expertos en las áreas de Computación de Altas Prestaciones (HPC) y Procesamiento de Lenguaje Natural (PLN) de la Universidad Santiago de Compostela, ha desarrollado una herramienta que permite adaptar automáticamente aplicaciones utilizadas en el ámbito del procesamiento de textos y documentos a modelos de computación (en concreto a la computación paralela compatible con clústeres multicore o de multitud de nodos), lo que reducirá notablemente los tiempos de ejecución y permitirá trabajar con volúmenes de datos muy superiores a los que se manejan en la actualidad.
Los resultados obtenidos harán posible el análisis de los datos de una forma más sencilla y eficiente. Su propuesta se ha basado en el diseño de un nuevo sistema que permite transformar el software usado para el ‘procesamiento del lenguaje natural’ (PLN, habitualmente programado en el lenguaje informático Perl, y ejecutado de manera secuencial) en una solución compatible con las tecnologías Big Data.
Con sólo introducir unas etiquetas en la aplicación original, esta herramienta de traducción permite al programador convertir automáticamente todo su código Perl en código Java adaptado al denominado paradigma MapReduce(modelo de programación utilizado por Google para dar soporte a la computación paralela sobre grandes colecciones de datos), habilitándolo así para su ejecución en un clúster, es decir, permitiendo su ejecución simultánea en múltiples cores o nodos de computación.
De esta forma, se logra multiplicar la velocidad de cómputo por un factor proporcional al número de procesadores disponibles (por ejemplo: si se dispone de 1.000 procesadores, el código resultante será, en el caso ideal, aproximadamente mil veces más eficiente que la solución secuencial).
Perldoop, una herramienta de código abierto
Otra característica de esta investigación, que ha dado lugar a la herramienta de traducción Perldoop, es que los resultados se han hecho públicos bajo una licencia de software libre, con el objetivo de que esté a disposición del mayor número posible de usuarios y profesionales. Como explica el profesor Juan Carlos Pichel, investigador principal y responsable del proyecto, la decisión se tomó porque “el desarrollo de soluciones Big Data para el PLN sólo está, en este momento, al alcance de las compañías más potentes”. Con la solución propuesta, y unos conocimientos mínimos de programación, será posible convertir cualquier código programado en lenguaje Perl a una solución capaz de funcionar en un clúster de computación.
Entre las principales ventajas de esta nueva solución destaca especialmente su versatilidad, ya que se trata de una herramienta de propósito general; gracias a ello, podrán beneficiarse aplicaciones de ámbitos tan diversos como la traducción automática, el análisis de información en blogs, o incluso el procesado de datos genéticos.
Fuente: CiTIUS – Universidad de Santiago de Compostela
V International Conference on Corpus Linguistics. V International Congress of Linguistics of Corpus (We will be attending the 5th International Conference on Corpus Linguistics. 14-16 March 2013.
We are proud to announce brat v1.3 (Crunchy Frog), an open-source (MIT),
browser-based tool for text annotation. The tool is available from:
And examples of past and potential usages presented at:
brat is a general-purpose tool for the visualisation and creation of
annotations for tasks such as entity mention annotation, relation and event
annotation, dependency syntactic annotation, and others. The new version
introduces many new features to the tool, including:
-Entity normalisation / linking / grounding support:
-Supporting embedded visualisations for web pages and web-based applications: http://brat.nlplab.org/embed.html
-Discontinuous text annotations
-In-built annotation tutorials and additional example corpora
-New annotation comparison functionality
-A fast, easy-to-use standalone server (experimental)
For details, please see:
If you want to upgrade an existing installation, please see:
brat is developed as a collaborative effort between several research groups as
an open source and open development project, and we warmly welcome
contributions and participation from the community. Possible contributions are
not limited to code, but also include feature requests, bug reports and more.
Since its initial release, brat has been adopted for use in corpus
annotation and other tasks by various groups. We hope this new version
will be as well received as the initial release of the tool, and will
gladly answer questions and welcome any feedback.
The new normalisation features were presented in:
Stenetorp et al. Normalisation with the brat rapid annotation tool
In proceedings of SMBM'12
A recent front page article in this newspaper suggested that Maltese, together with a number of other “small” European languages, risks being left out in the cold in the digital age (Maltese At Risk Of Digital Extinction, October 1). The immediate motivation for the article was a report published under the auspices of Metanet, a Europe-wide network of research centres involved in the development of language technology and resources, of which the University of Malta’s Department of Intelligent Computer Systems and Institute of Linguistics form part.
The digital extinction of Maltese is being addressed by ongoing developments both within academia and industry
- Albert Gatt
The report adopted the term digital extinction to describe the risk faced by languages which do not have adequate support in various areas of language technology.
The term has a satisfyingly ominous ring to it, one that was no doubt designed for the pages of the popular press.
Nevertheless, the point made by the report is well-taken. Broadly speaking, it is this: while some languages – notably English – appear to have a comfortable existence in the digital/computational world, as indicated both by their frequency of use in the electronic media and by the development of intelligent, language-sensitive technology for these languages, others like Maltese are far less well represented and are therefore a cause for concern.
There are two important principles that implicitly underlie this report.
The first is that multilinguality should be safeguarded as an outward manifestation of cultural and social diversity, with technology functioning as a bridge to effective communication.
The second is that all languages should be equal, that is, all speakers should be able to avail themselves of technology to facilitate communication, no matter how small the linguistic community they hail from.
Thu Oct 11 2012
Diss: Comp Ling/ Lexicography/ Semantics/ Text/Corpus Ling/ English: 'Can You Really Know a Word by the Company It Keeps?...'
Editor for this issue: Lili Xia <lxialinguistlist.org>
From: Nikola Dobric <Nikola.Dobricuni-klu.ac.at>
Subject: Can You Really Know a Word by the Company It Keeps? An Investigation into the Contextual Influence on Aspects of Polysemy
E-mail this message to a friend
Institution: Universitat Klagenfurt
Program: L 792 343 - Dr.-Studium der Philosophie Anglistik und Amerikanistik
Dissertation Status: Completed
Degree Date: 2012
Author: Nikola Dobric
Dissertation Title: Can You Really Know a Word by the Company It Keeps? An Investigation into the Contextual Influence on Aspects of Polysemy
Linguistic Field(s): Computational Linguistics
Subject Language(s): English (eng)
Allan Richard James
One of the most pressing issues in lexical semantics is surely the lack of
solid empirical criteria in accounting for sense distinction. The fact that to
date the only viable mode of word sense disambiguation has been based
on the researcher's own judgment implies that clearly defining the
boundaries of different interpretations of a polysemous lexeme and
expressing such a statement in empirical (linguistic) criteria is practically
impossible. The methodology explored within the thesis promises a fully
criteria-based account of word senses based on the use of representative
language corpora. The paper aims to test this claim, raised once again by
the recently re-emerging corpus-based decompositional approaches to
word sense disambiguation (WSD), prototypicality of senses, and sense
networks. Through the application of one of the most recent versions of
this methodology, namely Behavioral Profiling, to the polysemous verb
look, the paper will try to show how reliable the methodology is in its
promise of an objective and purely linguistic account for word senses.
A team at Swansea University has developed an online tool that allows researchers to compare multiple translations of Shakespeare at the same time to see how much they vary...
A team at Swansea University has developed an online tool that allows researchers to compare multiple translations of Shakespeare at the same time to see how much they vary.
The platform can be used for any text, but has been demonstrated with Act 1, Scene 3 of Shakespeare's Othello, where the eponymous hero gives a persuasive speech about his courtship of Desdemona to her disapproving father Brabantio and others. Users can compare the original base English text (Michael Neill's OUP edition) with any one of 37 different German editions, dating from 1766 to 2010 -- something that the team calls a "translation array".
The most intriguing tool is the "Eddy value" tool which allows you to select individual lines from the scene and compare them to the translations from the 37 different German editions (the aim is to add in further languages and additional scenes, but the project needs more funds to do this). Based on analysis of all of the translations, each of the individual line translations has been awarded a numerical value. The higher the Eddy value, the more distinctive the translation, i.e. the more it stands out from the crowd. The value is calculated using word frequencies in the whole set of translations.
Linguist Tom Cheesman, who heads up the project at Swansea University explains: "If you say 'deviation from a norm', it is misleading, conceptually and statistically. Translation doesn't work like that: people think there's a 'right version' and then various kinds of mistake. No: it's about differing interpretations, not about right and wrong."
RESEARCHERS from Swansea University have developed a new computer system for looking at translations of texts.
The online tool will allow people to look at differences in texts — for example in translations of Shakespeare in different languages around the world — and study the differences between them, and why they differ.
The project, funded by the Arts and Humanities Research Council, brings together experts in languages, computing science, English and design.
I often pass trucks like the one pictured below in my travels to and from northern Michigan (this one happened to be stopped at a gas station where I was filling up which allowed me to snap the picture).
I think it is wonderful that the company assembles 100% of the toys in their product line in the United States and that all of the plastic used in the toys is purchased in the USA (see here).
Unfortunately, every time I see these trucks I can't help but think "another load of crap."
The results from a Google Ngram Viewer search comparing the use of the phrases "load of crap" and "load of toys" in American English helps to explain why.
CALL FOR PAPERS
Panel on "Corpus-based translation studies"
7th EST CONGRESS, 29 to the 31 August 2013, University of Mainz in Germersheim,
Panel organizers: Claudio Fantinuoli and Federico Zanettin
While corpus-based research critically depends on the availability of suitable
tools and resources, there is still a lack of user-friendly tools allowing
researchers in the soft sciences to create and analyze corpora according to the
standards of the discipline.
This panel aims to provide a framework for discussing corpus data, tools and
approaches which may allow translation scholars to collaborate among them and
with the NLP community, in order to improve the quality of resources and make
them available and accessible, with the ultimate goal of bridging the gap
between the hard and soft sides of this multi-faceted field.
Contributions related, but not limited, to the following topics are welcome:
·NLP-oriented perspectives and methods for T&I research
·Corpus-based methodologies and T&I studies
·Annotation models for descriptive translation studies
·Translation and corpus design
· Qualitative and quantitative approaches to corpus analysis in T&I studies
· Corpus-based translation studies and minority languages
· Accessibility issues: copyright and data distribution
· Corpus compilation tools for T&I studies
· Metadata for descriptive translation research
· Methods and techniques for data collection
· Corpus-based analysis of translation shifts
· Parallel corpora in T&I studies
· Alignment of parallel corpora
· Usability of software for corpus building and analysis
· Spoken corpora and alignment of transcriptions and audio/video recordings
Researchers are invited to submit their paper proposals until 1 November 2012
using the Congress Web service. More information about the congress, panels and
venue are available at: http://www.fb06.uni-mainz.de/est/index.php
Using corpora in translation studies
In the last decades corpus linguistics has gained wide acceptance in lexicography and in other fields of applied linguistics. But only recently has the corpora (large machine-readable text collections) been explored as a useful approach in translation studies. Although dictionary data still constitutes the primary reference source for professional translators and translation students, corpus evidence better supports text production both in the first language and in a second language, for instance by providing contextual variants of collocations with fine-grained meaning distinctions.
My presentation aims to show the main advantages (but also the limits) of using corpora in translation studies with the help of a few examples taken from class lectures and homework done by current Italian and German MA students in Heidelberg. General language and specialised language collocations (word combinations) as well as culture-specific words (realia) will be used for this purpose.
Dr Laura Giacomini is a visiting fellow at the ANU Centre for European Studies and a lecturer and researcher of the Department of Translation and Interpretation at the University of Heidelberg. She also works as a sworn translator and lexicographer. In 2011 she completed her PhD in the field of applied linguistics, developing a new approach for the treatment of collocations in electronic dictionaries. She is presently carrying out further corpus-based research on collocations and other phraseological units, both from a lexicographic and a translation perspective.
RSVP: email@example.com by Wednesday 5 September 2012.
ANUCES is an initiative involving four ANU Colleges (Arts and Social Sciences, Law, Business and Economics, and Asia and the Pacific) co-funded by the ANU and the European Union.
What is corpus linguistics?
Not a branch of linguistics, like socio~,
Not a theory of linguistics
A set of tools and methods (and aphilosophy) to support linguisticinvestigation across all branches of thesubject
London is about to experience Olympic fever again with the Opening Ceremony of the Paralympic Games taking place tonight. Already disabled athletes have started appearing in the city and interacting with locals and other visitors.
The Paralympics provide a great occasion to focus attention on the issues and difficulties faced by disabled people across the world. The BBC reported earlier today that:
“if Chinese athletes perform as well in the Paralympic Games [a China did in the Olympic Games] it could help change attitudes towards disabled people in China. The Beijing Paralympic Games in 2008 played a huge part in changing attitudes, but campaigners say China still has a lot to do”.
Locally, the Head of Scope Cymru has made a similar point in the context of a survey showing attitudes to disabled people are worsening in Wales.
Those of us interested in endangered languages might think of sign languages and the Deaf community (since all sign languages are endangered and subject to pressure from speakers of majority spoken languages), however, as UK Deaf Sport reminds us: “many Deaf people do not consider themselves disabled, particularly in physical or intellectual ability. Rather, we consider ourselves to be part of a cultural and linguistic minority”. There is in fact a separate Deaflympics, “the second oldest multi-sport and cultural festival in the world, with a proud history stretching back to the first Games in Paris, in 1924″ and sanctioned by the International Olympic Committee. It was recently announced by Craig Crowley, President of the International Committee of Sports for the Deaf, that the next Summer Deaflympics will be held in Sophia, Bulgaria in 2013 (following the cancellation of plans for Athens).
The visibility (no pun intended) of sign languages among linguists, and the wider community, has been slowly increasing in recent years, however, like other minorities and the disabled there is still some way to go. For example, the list of DoBeS projects of the Volkswagen Foundation does not include any sign languages at all, despite the information for applicants [.pdf] stating that “documentation projects may focus on endangered dialects, moribund languages as well as sign languages”. The Endangered Languages Documentation Programme at SOAS has so far funded eight projects on sign languages, namely:
Australian sign language by Trevor Johnson, Macquarie University
Côte d’Ivoire sign language by Tano Angoua Jean-Jacques, University of Cocody at Abidjan
a village sign language of India by Sibaji Panda, University of Central Lancashire
Malian sign language by Victoria Nyst, Leiden University
a village sign language in Bali by Connie de Vos, International Institute for Sign Languages and Deaf Studies
Mardin sign language of Turkey by Ulrike Zeshan, University of Central Lancashire
Mexican sign language by Claire Ramsey, University of California San Diego
Inuit sign language by Joke Schuit, University of Amsterdam
Corpora for several of these are available in the Endangered Languages Archive at SOAS, namely Auslan, Malian sign, Indian village sign, and Inuit sign.
Short Stories in French: New Penguin Parallel Text (French Edition)
by Ed: Richard Coward | Literature & Fiction
Registered by melissahuxley of Abu Dhabi, Abu Dhabi United Arab Emirates on Friday, August 17, 2012
Principles of corpus linguistics and their application to translation studies researchGabriela Saldanha
Centre for English Language Studies, University of Birmingham
Corpora have been put to many different uses in fields as varied as natural languageprocessing, critical discourse analysis and applied linguistics, to mention just a few. As isto be expected, within each of those areas corpora fulfil different roles, from providing datato build statistical machine translation systems to revealing ideological stance in politically-sensitive texts. ‘Corpus linguistics’ is understood here in a more restricted sense, linked toBritish traditions of text analysis that see linguistics as a social science and language as ameans of social interaction where meaning is inextricably linked to the cultural andhistorical context in which it is produced. This article focuses specifically on the principlesof corpus linguistics as a research methodology, and looks at the implications of thisspecific approach to the study of language in translation studies.
2. A corpus defined in corpus linguistics terms
Because there is no unanimous agreement on the necessary and sufficient conditions for a collection of texts to be a corpus, the term ‘corpus’ can be seen in the literature referringsometimes to a couple of short stories stored in electronic form and sometimes to thewhole world wide web. In order to discuss the fundamental principles of corpus linguistics,it is important to first establish certain limits around what can and cannot be considered a‘corpus-based’ study of translation.Different definitions of corpus emphasise different aspects of this resource. The definitionoffered by McEnery and Wilson (1996: 87), for example, emphasises representativeness:“a body of text which is carefully sampled to be maximally representative of a language or language variety”. The problem with making representativeness the defining characteristicof a corpus is that it is very difficult to evaluate and it will always depend on what thecorpus is used for. A way around this problem is found in the definition offered by Bowker and Pearson (2002: 9): “a large collection of authentic texts that have been gathered inelectronic form according to a specific set of criteria”. Bowker and Pearson’s definition ismore flexible than McEnery and Wilson’s, even if the assumption is still that the corpus isintended to be “used as a representative sample of a particular language or subset of thatlanguage” (Bowker and Pearson, 2002: 9). However, in making selection criteria and notrepresentativeness the defining characteristic, Bowker and Pearson allow for a certainflexibility that reflects more accurately the fact that corpus representativeness is alwaysdependent on the purpose for which the corpus is used and on the specific linguisticfeatures under study. For example, a corpus that represents accurately the distribution of a common feature – say, pronouns – in a certain language subset may not representaccurately a rarer feature, such as the use of reported speech, in the same subset.Generally, corpora are intended to be long-term resources and to be used for a variety of studies, so representativeness cannot be ensured at the design stage.
Calls: Discourse Analysis, Semantics, Text/Corpus Linguistics/China
Editor for this issue: Alison Zaharee <alisonlinguistlist.org>
From: Le Cheng <chengle163hotmail.com>
Subject: 3rd International Conference on Law, Language and Discourse
Full Title: 3rd International Conference on Law, Language and Discourse
Short Title: LLD3
Date: 03-Jun-2013 - 06-Jun-2013
Location: Shanghai, China
Contact Person: Le Cheng
Meeting Email: < click here to access email >
Linguistic Field(s): Discourse Analysis; Semantics; Text/Corpus Linguistics
Call Deadline: 31-Jan-2013
3rd International Conference on Law, Language and Discourse
Legal Discourse: Forms and Functions
Shanghai, 3-6 June 2013
Shanghai Jiao Tong University
Organizer: School of Foreign Languages, Shanghai Jiao Tong University
Co-organizer: Multicultural Association of Law and Language
Conference Chair: Zhen-hua Wang, Shanghai Jiao Tong University, China
Convener: Le Cheng, City University of Hong Kong, HK
bab.la dictionary: Search the free online dictionary for millions of translations in many different languages.
dictionnaire bab.la: Cherchez dans le dictionnaire en ligne gratuit des millions de traductions dans différentes langues.
Nous offrons des traductions en différentes langues, allant d'expressions familières et régionales à du vocabulaire plus technique et spécialisé. Les fonctions spéciales incluent des filtres de recherche, des synonymes, de la prononciation, des phrases d'exemple et beaucoup plus. Choisissez votre dictionnaire préféré parmi la liste ci-dessous. Aidez-nous à améliorer nos dictionnaires en suggérant de nouvelles traductions dans le champ ci-dessous, au bas de la page.
Today\'s Topics. What is a Bilingual Dictionary?The Steps to Making a Bilingual DictionaryParallel Corpora and Comparable CorporaEquivalents in Bilingual DictionariesStructure of Bilingual Dictionaries.
A list of the 10 000 most used French words, according to Belgian written sources. The list has been 'cleaned up' by removing some red links for words that clearly do not meet WT:CFI. However, if you disagree, you are free to add back these links and/or start the articles in French. These modifications are listed on the article's talk page.
The ranks of word frequency were calculated by running word list in wordnet dictionary database against a few popular search engines from 2002 - 2003. It basically uses search engine index databases as corpus. The size of the corpus ranges from 1 billion to 4 billions.
A link to our online wordnet directory is provided for words which have the frequency rank above 2,000.
Review: Historical Linguistics; Sociolinguistics; Text/Corpus Linguistics; Spanish: Garcia Godoy (2012)
EDITOR: García-Godoy, María Teresa
TITLE: El español del siglo XVIII
SUBTITLE: Cambios diacrónicos en el primer español moderno
SERIES TITLE: Fondo hispánico de lingüística y filología. Vol. 10
PUBLISHER: Peter Lang
André Zampaulo, Department of Spanish and Portuguese, The Ohio State University
The edited volume “El español del siglo XVIII” (’18th-century Spanish’) is a
collection of studies dedicated to diachronic change in the first stage of
Modern Spanish. Following the editor’s introduction, the book features ten
chapters organized as four parts: ‘Periodización’ (‘Periodization’), ‘Léxico’
(‘Lexicon’), ‘Morfosintaxis’ (‘Morphosyntax’) and ‘Variedades diatópias’
In her introductory chapter, editor María Teresa García-Godoy reflects on the
importance of the 18th century to the history of Spanish. After major linguistic
changes documented in the 16th and 17th centuries (e.g. the devoicing and
dissimilation of medieval Spanish sibilants), the 1700s have been traditionally
viewed as a flavorless period in the diachrony of Spanish (Lapesa 1981:
400-401). External factors such as the foundation of the ‘Real Academia
Española’ (‘Spanish Royal Academy’) in 1713 and the publication of prescriptive
documents such as the ‘Gramática de la lengua castellana’ (‘Castilian Language
Grammar’) in 1771 contributed to the standardization of Spanish in this century,
overshadowing relevant linguistic changes. As their ultimate goal, the papers in
the current volume shed light upon these changes by revealing and analyzing new
sets of data from both Peninsular and Hispanic American varieties and opening up
a relatively unexplored field of research within Spanish historical linguistics.
Part I features a chapter on the periodization of the history of Spanish and the
general contribution from research on 18th-century texts.
Gatineau, Quebec (PRWEB) July 26, 2012 MultiCorpora, an international provider of industry-leading translation technology solutions, has announced that its most recent release of MultiTrans Prism has been evaluated on TMS Live by Common Sense...
Gatineau, Quebec (PRWEB) July 26, 2012 MultiCorpora, an international provider of industry-leading translation technology solutions, has announced that its most recent release of MultiTrans Prism has been evaluate...
Over 5.2-million books have been analyzed and, perhaps not too surprisingly, the word...