Cautioning on the perils of machine translation, especially when expectations surpass achievable results, Michael Bauer on his Dear Developer blog posts a letter he has written to the Conservative MSP Murdo Fraser following the latter’s suggestion that Scottish (Scottish Gaelic) be added to the list of languages available on Google Translate:
“I’m sure that this is a well-intentioned idea but in my professional opinion, it would have terrible consequences. As one of the few people who work entirely in the field of Gaelic IT, I have a keen interest in technology and the potential benefit – and damage – this offers to languages like Gaelic. As it happens, I also was the Gaelic localizer (i.e. translator) for Google when it was still running the Google In Your Language programme and I have watched (often with dismay) what Google has done in this are since. One of the projects that certainly caught my eye was Google Translate, especially when Irish was added as a language in 2009. But having spoken to Irish people working in this field and having watched the effects of it on the Irish language, I rapidly came to the conclusion that while it looks ‘cool’, being on a machine translation system for a small(er) language was not necessarily a benefit and in some cases, a tragedy.
Without going into too much technical detail, machine translation of the kind that Google does works best with the following ingredients:
– a massive (billions of words) aligned bilingual corpus
– translation between structurally similar languages or
– translation from a grammatically complex language into a less grammatically complex language but not the other way round
– translation of short, non-colloquial phrases and sentences but not complex, colloquial or literary structures
In essence, machine translation trains an algorithms in ‘patterns’, which is why massive amounts of data are needed and why it works better from a complex language into a less complex language.
Unfortunately for Irish, none of these conditions were met – and would also not be met for Scottish Gaelic. To begin with, even if we digitized all the works ever produced which exist in English and Gaelic, the corpus would still be tiny by comparison to the German/English corpus for example.
Then there is the issue of linguistic distance, Irish/Gaelic and English are structurally very different, with Gaelic/Irish having a lot more in the way of complex grammatical structures than English.
Whatever the intentions of the developers, people will misuse such a system. I have put together a few annotated photos which illustrate the scale of the disaster in Ireland here. From school reports to official government websites, there are few places where students, individuals or officials trying to cut corners have not used Irish translations of Google Translate in ways they were not intended to be used.
I think we can all agree that the last thing Gaelic needs is masses of poor quality translations floating around the internet. Funding is extremely short these days and this would, in my view, be a poor use of these scarce funds. There are more pressing battles to be fought in the field of Gaelic and IT, such as the refusal by the 3rd party suppliers of IT services to Gaelic schools and units to provide (existing) Gaelic software or even a keyboard setting in any school that allows students to easily input accented characters, be that for Gaelic, Spanish or French.”
All good points though you should read the full letter here. To these objections has been added the voice of John Storey, who has worked with the Gaelic Books Council and knows a thing or two about this issue:
“Without sounding too dramatic, may I suggest that, far from being positive, the potential exists for Google Translate to have a very negative impact on the future well-being of Gaelic. Many advocates of Google Translate forget that the current Google set-up favours English and other majority languages. Many of us argue that Google Translate actually exacerbates the dominance of majority languages such as English. Having the ability to switch from a minoritized language (in this case, Scottish Gaelic) to a majority language, English, can potentially encourage some interest and appreciation of the target language, but more often than not this will only encourage learning at a very basic level.
Another key area of concern is that of translation itself: standards of translation, as well as the well-being of the fragile but vibrant Gaelic translation sector in Scotland. I have worked in the Gaelic publishing industry for a number of years now and have had regular dealings with a variety of translators throughout the country.
No doubt Google Translate’s technology, software and performance will improve in future years, but you will never match the quality, subtlety and sensitivity available through a skilled human translator.
The danger is that many non-Gaelic individuals and groups – including companies who are asked to translate into Scottish Gaelic (perhaps, for example, as a result of the Gaelic Language (Scotland) Act) – will use Google Translate as the ‘easy option’. [ASF: As we know in Ireland from some recent scandals which caused considerable public outrage]
I would, with the greatest respect, ask you to reconsider. Technology is vital to the future of minoritized languages such as Gaelic. However, Google Translate and Gaelic should not be a priority at the moment: there are many more pressing needs with regard to our language and technology. For example, the national Gaelic development body, Bòrd na Gàidhlig, could be supported to address the shortage of technology and computer teaching staff in Gaelic medium education, particularly at Secondary School level; they could be allowed to encourage increased investment in Gaelic gaming and the software industry; they could be looking at practical measures to increase usage with regard to mobile technology and Gaelic; and so forth. There are a host of other technological needs.”
Personally I have to agree that standardised “language localisations” of software and technology are more deserving of funding and development by governments in Ireland and Scotland than online translation programs of dubious quality or worth. Irish-speakers and Scottish-speakers already exist. What they require are services and goods in their own languages, something that will of course also benefit new speakers of the Gaelic dialects. Establishing a consistent linguistic milieu across several technological platforms, from PCs to smartphones, TVs to Bluray players, Windows to OS X, Facebook to Snapchat, should be the primary objective in this area if state support is to be involved.
I believe lots of people have a decent sprinkling of Irish but are intimidated from using it for fear of ridicule from the purists (I don’t use that as a derogatory term) I understand that the language must be maintained in some fashion. However for those with school Irish who would like to introduce a bit of it to their everyday life, reference sites like google translate may be useful. Maybe not accurate but useful. What is written Irish anyway? and who recorded it, for as far as I am aware we were all illiterate excepting a few monks who mostly spoke Latin
I have to agree with your conclusion that software localisation is more important, with the resulting offering of digital products to speakers and learners. Fundamental though, would be the establishment of a dedicated technology department or branch to work on corpus compiling, text to speech technology and other important background work which will be fundamental for future language technology. In Wales, both Bangor and Aberystwyth Universities have been doing fantastic work in doing the ground work for the creation of Welsh language technologies. DCU and Trinity are doing similar work in Ireland. Scotland needs to start compiling corpus’, online grammars etc. as well as it’s own translation algorithms if it is not to be left behind in the coming world of speech interface technologies.
No sane person would use google translate for anything that will be displayed for other people to see.
I’ve used it to translate Latvian to English and vice versa for fun and it occasionally spits out hilarious bullshit,
It’s useful if you want to get a rough translation of a foreign language text and nothing more.
It’s even worse if you use it to translate from one non-English language to another, because it translates from the 1st language to English and then from English to the 2nd language.
Jānis, I agree wholeheartedly with you on this point. There must be something wrong with me 🙂
A “high Gaelic” similar to a “high German,” would make things considerably easier for the purpose of speaking across the provinces, writing, translation, teaching, and learning for novices. Steps by courses like Gaeilge gan Stró and others are beginning to be more holistic in their approach, but our language is still greatly decentralized in terms of expression and writing. Unless you really speak it, much is lost between Ulster and Munster. I know, there are many opposed (I used to be one of them), as it does take away much of the character, but a unilateral approach to a more common denominator would give things a boost.
You should also make the writing system 100% phonetic while you’re at it. (like Latvian or Spanish, for example)
Err, Latvian written ‘o’ has at least three different pronunciations, and then there are those ‘a’s that sound like ‘e’s, and that’s just in the ‘standard’ language, which probably no one apart from school teachers and newsreaders actually speak (assuming Latvia is no different from most nations). And OTOH Gaelic (Scottish and Irish) spelling is ‘phonetic’, you just have to understand the system. It’s the underlying phonology (system of sounds) that’s complex, more so in Scots than Irish. But you can’t do anything about that without ‘dumbing down’ the language. The only alternative to the present system would be words festooned with diacritics, and given the problems already encountered with a single length mark (the fada) in an anglophone environment, that would be a non-starte. MB referred to in the original article has useful books and web pages explaining the SG system it in detail. Se duine gasta a th’ann.
If your writing system makes sense – people can learn your language more easily – and that’s especially important for dying languages like Irish which IMO should be made as accessible as possible.
I’m currently studying both Spanish and French and I find learning Spanish much more enjoyable because their writing system makes sense and is similar to the Latvian one while the French one is complete bullshit and I’m making mistakes all the time.
Yes, the French tried to make their spelling look more like Latin at some stage in their history, and added in lots of letters (not always correctly) that hadn’t been pronounced for centuries. However you can generally get from the written word to the pronunciation without error, the problem is going the other way. Even native French people have problems. But English is much worse, completely irrational both ways. English really works like Chinese, each meaning has its own graphical representation (usually) but that’s only a very rough guide to the sound. I’ve a fairly logical mind, and was even more like that as a child, and English spelling drove me crazy because I detested wrote learning, and was always being made to feel I was an idiot. English spelling is *totally insane!!!*
But you appear to have completely mastered English yourself, as a foreigner. And it’s rare indeed to meet anyone from Europe these days, anyone from Northern Europe at least, who doesn’t have very good English. (Which is why English speakers don’t usually take much interest in other languages).
Manx btw is a form of Gaelic whose spelling is based on English, and equally insane. I can read it a bit but only because I learned Scots Gaelic, which is very similar, first. At least Scots and Irish Gaelic have systems that are logical if complex, English, Manx, French … are all insane systems designed to make kids feel stupid.
What worries me about ‘brute force’ corpus-based machine translation is that the input material is often harvested from the internet. And the results of people using the system generally go online these days, so isn’t there a danger of the system feeding on itself leading to increasing degradation of the results?