The term language resources refers to sets of language data and descriptions in machine readable form, including written and spoken corpora, grammar and terminology databases. Language resources can be used to build, improve or evaluate natural language systems such as machine translation engines.
Why are language resources needed?
Language resources are needed to improve machine translation’s quality, both in general and specific domains. To improve the CEF Automated Translation (CEF AT) platform’s services, the underlying automated translation systems must be trained with relevant language resources in all official languages of the 30 countries participating in the CEF programme. Large general domain corpora, whether monolingual (e.g. official corpora of national languages) or multilingual, should be sought, as well as domain-specific language resources in the fields of consumer rights, culture, legal domain, social security, health, public procurement, etc. These domains will be covered by online public services to be supported by CEF AT.
Significant amounts of valuable linguistic data are generated every day in all Member States, and in the CEF-affiliated countries, non-governmental and private organisations. A large part of this data can be very valuable as language resources for the CEF AT platform.
The European Language Resource Coordination action is looking for open data that can be made available for re-use through open data initiatives, but also for commercially available datasets.
Language resources for the CEF AT platform
Some datasets produced by public administrations can be used directly by the automated translation system: aligned corpora from translation memories, terminology resources, lexica and dictionaries. Many other resources are published as information documents (reports, guides, flyers, records of administrative decisions, etc.) which will need additional processing to be turned into language resources, provided that they come in a reusable format. For example, a scanned PDF may be unexploitable if it is produced with simple OCR tools.
Which types of data are useful for MT training?
To meet the automated translation requirements, relevant language resources are of various types:
Translation memories: linguistic databases that capture translations made by humans. They can be used to facilitate future translations tasks but also for training automated translation systems
Translation/language models: statistic information that assigns a probability to a piece of unseen text, based on some training data
Corpora: monolingual and multilingual corpora; comparable, aligned, parallel documents
Lexica: monolingual and multilingual lists of words, multi-words, sentences, etc. in general or specific subject fields
Terminological resources: structured sets of concepts, with associated linguistic information in a specific subject field
Grammars: sets of rules that formalise a language
Where to look for language resources?
Relevant sources of valuable language resources are Public Services in the EU Member States and countries associated to the CEF programme.
These could be Public Services with a mission at the national, regional, local, cross-border or cross-country (bi-lateral, multi-lateral) level, as well as international organisations with an European basis and mission, including Head of State offices, national or federal ministries, parliaments, regional governments, local authorities, etc.
These could also be public administrations responsible for online e-government platforms and services in the CEF-relevant areas (e.g. consumer rights, culture, legal domain, social security, health, public procurement), publication offices of ministries, documentation centres, national libraries, etc.
Use Case – Reuse of Emergency Calls embedded in TV Shows
As part of the ELRC initiative (2015-2022), the ELRC legal helpdesk analysed under which legal conditions audio, video and dialogue subtitles coming from emergency calls embedded in a German TV show can be re-used for developing AI models. This analysis was a response to the inquiry submitted to the ELRC Helpdesk by a research project that focusses on building AI models for improving emergency call assistant systems. The analysis reviewed several legal aspects specific to the German legislation, including intellectual property and copyright protection, but also tackled the use and sharing of different types of data and their derivatives for research and commercial purposes.