Machine translation is an attractive interdisciplinary task relying on computer science, linguistics, statistics and mathematical modeling, with specific challenges to software engineering – as the volume of the data processed typically reaches tens of billions of words. At the Institute of Formal and Applied Linguistics, several prototypes of machine translation systems have been created and implemented. Two approaches have been adopted.
First, phrase‑based statistical translation which uses machine learning based on very large text corpora; second, a deep linguistic analysis of whole sentences is used. Combination of both approaches, e.g. in a form of grammatically‑based corrections of the phrase‑based translation output, seems highly promising. In addition to text‑to‑text machine translation, cross‑language search which operates on multimodal data and multilingual dialogue systems enabling natural human‑computer interaction are being developed. Translation systems developed at the Institute focus especially on translation between the Czech language and other European languages (esp. English, German, French and Spanish); however, a phrase‑based translation system can be quickly adapted for any language pair for which a sufficient amount of language data can be obtained. Further development of machine translation methods and systems is being pursued in collaboration with many academic and commercial partners within several European, U.S.‑based and national projects as well as in direct cooperation with industry.
Research in Machine Translation and related text and speech analysis areas is supported by a Research Infrastructure LINDAT/CLARIN (http://lindat.cz), which is part of the European Clarin ERIC network. The infrastructure collects and prepares open language resources necessary for all areas of research in computational linguistics and natural language processing. It also provides open tools and services for both fundamental as well as applied research in this dynamic area.
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst: Moses: Open Source Toolkit for Statistical Machine Translation. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume, Proceedings of the Student Research Workshop, Proceedings of Demo and Poster Sessions, Tutorial Abstracts, pp. 177-180, 2007.
J. Hajič, E. Hajičová, J. Panevová, P. Sgall, O. Bojar, S. Cinková, E. Fučíková, M. Mikulová, P. Pajas, J. Popelka, J. Semecký, J. Šindlerová, J. Štěpánek, J. Toman, Z. Urešová, Z. Žabokrtský: Announcing Prague Czech-English Dependency Treebank 2.0. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), İstanbul, Turkey, pp. 3153-3160, 2012.
O. Bojar, R. Rosa, A. Tamchyna: Chimera – Three Heads for English-to-Czech Translation. In: Proceedings of the Eight Workshop on Statistical Machine Translation, Sofija, Bulgaria, pp. 92-98, 2013.
F. Jurčíček, B. Thomson, S. Young: Reinforcement learning for parameter estimation in statistical spoken dialogue systems. (GS) Computer Speech and Language, 3, June 2012.
P. Pecina, O. Dušek, L. Goeuriot, J. Hajič, J..Hlaváčová, G. J. F. Jones, L. Kelly, J. Leveling, D. Mareček, M. Novák, M. Popel, R. Rosa, A. Tamchyna, and Z. Urešová. Adaptation of machine translation for multilingual information retrieval in the medical domain. Artificial Intelligence in Medicine 61, pp. 165-185, Elsevier, 2014.