Rapid advances in machine learning technology allow the building of large language models (LLMs) that need to be trained by massive datasets. These models have given rise to ChatGPT and other virtual chat assistants (or artificial “intelligence” chatbots). It is not hard to realise that these rapidly spreading bots rely on models designed primarily for parsing and generating English text. The design and development of digital language technologies in general, but especially the technologies relying on LLMs, call for a deep power analysis on who is building this technology, who will benefit from it and who will decide its future. Timnit Gebru, an Ethiopian computer scientist specialising in algorithmic bias, argues that the construction of LLMs is advancing with little evaluation of the ethical risks and without any strategy for the elimination of embedded biases in the datasets – biases that perpetuate racism and other forms of violence and discrimination that disproportionately affect marginalised communities.
Processes of colonisation, mass genocide and extractivism shaped the language world map we know today. “Minority” languages and languages in danger of dying (over 40% of all languages) were once regionally dominant languages used by millions of peoples from Indigenous nations across the Global Majority World: Asia, Africa, the Americas, the Caribbean and the Pacific Islands. [1] Most of these were non-textual forms of languages, i.e. oral, gestural, visual, or even transmitted through sound (e.g. with drums). Today, of over 7,000 languages worldwide, only about 4,000 have written systems or scripts, most of which were developed through colonisation rather than by language speakers. Furthermore, those languages with a written form but not English or other Euro-colonial languages [2] remain in the margins of academia, the publishing industry and public online knowledge production at large. Some 60% of content online is available only in English, and most well-established academic journals in science or social science are in English .
The significant exclusion of a great proportion of the languages of the world – and their many forms and modalities – from digital language technologies negatively impacts the data available for natural language processing (NLP). This subsequently affects large linguistic datasets used to train LLMs. Improving these datasets is a time-intensive and emotionally demanding task usually performed by people in Global Majority World countries, where labour is cheap and working regulations loose. But even when certain improvements are introduced, they are aimed at datasets in dominant languages, usually English. Thus, native languages are doubly or triply excluded from the NLP system, because they generally do not have a written system and do not have an appealing economic value for big tech companies or developers; hence the low interest and limited resources to have representative datasets in native languages.
When we look at the content that is available online, we see that over 75% of those who access the internet do so in only 10 languages. Over 90% of Africans need to switch to a second language to use some of the major platforms and applications we use to create content today. Minority languages have less availability than Euro-colonial and other widely spoken languages (such as Mandarin Chinese or Arabic): online services are fewer, fewer interfaces are supported in these languages, and there is reduced user support available. In sum, major platforms and applications offer a far better experience to widely spoken language users than minority language users. Clearly, coming online can be a time-intensive and challenging task for most of the population whose first language of choice is not Euro-colonial or the dominant language in their region.
These are some of the reasons why it is very challenging for marginalised communities to create affirming content and to bring their knowledges online – this is especially difficult for those people sitting at the intersections of multiple systems of discrimination and oppression (like racism, homophobia, ableism, casteism, to name a few). Of all the things the internet could be for these communities, it becomes an inherently dangerous place. The content on the internet, on the other hand, roughly becomes a recollection of the ways of being and of understanding the world of a privileged minority – and these are the views, epistemological frameworks and ontologies that are then fed to LLMs via linguistic datasets.
Pushing for multilinguality and multimodality through practice
Learning and reflecting on the state of languages online is a core aspect of the work on knowledge and language justice by Whose Knowledge? We have done this in different ways, including through the State of the Internet’s Languages (STIL) report launched in February 2022, in partnership with the Oxford Internet Institute and the Centre for Internet and Society (India).
The power analysis and the invitations and provocations for the future we offer here come from the rich tapestry that the STIL report is. As most of the work that challenges ongoing and historical structures of power and privilege, this is a continuous and community-led effort where the process of bringing it together was as important as the final result.
For example, as a challenge to the rapid pace and adoption of automated translation tools, we brought together a group of translators with anti-colonial values to translate the STIL report. By placing people at the centre of the translation process, we not only built a community but also made it possible for these people to bring their language skills to the report.
We also took the opportunity to test video conferencing and streaming technology with critical lenses: the STIL report was launched in a live online panel, moderated in Portuguese, English and Spanish, with panellists speaking in Zapotec, English, Spanish and Bengali, and with simultaneous interpretation in English, Spanish, Portuguese, Arabic and Bengali. None of the video conferencing platforms we tested performed well in such a comprehensive multilingual scenario: they lack broad interface language support and an accessible and interoperable way to simultaneously broadcast an event in different languages to different channels.
Since only a fraction of the world's languages have writing systems, we strive not to perpetuate the dominance of text over other forms of language. That is why in the STIL report we made the leap to give more importance to audio, images and video (including an international sign translation), also improving the accessibility of content.
Recognising that we all have many different skills and experiences and need to work together to create a truly multilingual internet, the STIL report offers an agenda for action to advance towards a more multilingual and multimodal internet.
Image: The authors' version of how a machine would see the main image in this article.
Looking to the future: Shifting our ways of doing and dreaming
Changing the narrative around the development and possibilities of language models is a challenge that calls for a profound rethinking of the technologies we build and an invitation to dream differently, as some tech communities are already doing.
The tech industry is not entirely responsible for how and why most languages are not represented online. But the capitalist and techno-chauvinist values that lead the sector do perpetuate the mechanisms that marginalise minority languages online. Developers of digital technologies must reflect on how the tech they create and their companies’ policies are contributing to deepening existing systemic injustices, including language discrimination.
Tech companies, as key players, must put ethics and community consent at their core when creating language data, ensuring that language communities have power and safety over what and how they share. This implies that digital languages technologies must be articulated around the contexts, needs, designs and imaginations of locally based but globally connected language communities rather than trying to fit linguistic multiplicity into a single technological model.
Language speakers should be a central part in developing technologies and creating content on the platforms and tools they use. Tech companies and standards organisations should prioritise this participative model and see it as a fundamental human right. To achieve this, we need to rethink together the governance model of the language infrastructures and advance towards a more fair, community-based and distributed governance set of practices. Building from relatively small, community-governed datasets and through human processes based on mutual respect with marginalised communities is crucial.
These small-scale community-approach processes would also enable and empower (marginalised) language communities to be at the core of the design of these technologies instead of at the margins. Let’s remember that just as the imagination, expertise and ancestral knowledges of Indigenous nations are essential to face the impending ecological crisis, they can also teach us to design language tech honouring collective and community memories.
Let’s dare to imagine digital language technologies, such as virtual assistants like ChatGPT, which allow for all languages and human knowledge in their multiple forms and in all their vastness to be represented online. Most importantly, let’s dream of an internet where the Global Majority of the world, who have historically struggled to be seen, heard and recognised, can use the internet to the fullest and with joy. After all, languages are much more than a means of communication; each language is a way of being, doing, knowing and imagining. Ngũgĩ wa Thiong’o, in Decolonizing the Mind, states, “Language, any language, has a dual character: it is both a means of communication and a carrier of culture.” [3] Communicating in a colonial language is a cerebral activity and not an emotionally felt or embodied experience, suggests the author. Let’s aim at digital language technologies that allow us all to tell our stories and share our knowledges with honour and dignity.
Notes:
[1] The Global Majority World or minoritised majority of the world is referred to in the State of the Internet’s Languages report in the following way: “Historical and ongoing structures of power and privilege result in the discrimination and oppression of many different communities and peoples, across the world. These forms of power and privilege are often interlocking and intersecting, so some communities are disadvantaged or oppressed in multiple ways: for instance, by gender, race, sexuality, class, caste, religion, region, ability, and of course, by language. Whether online or in the physical world, these communities make up the majority of the world by population or number, but they are often not in positions of power, and therefore they are treated like a minority. In other words, they are the ‘minoritized majority’ of the world.”
[2] European colonial languages or Euro-colonial languages are defined in the State of the Internet’s Languages report as: “Languages from Western Europe that spread across Africa, Asia, the Americas, the Caribbean and the Pacific Islands through the processes of colonization by Western European companies and governments, from the 16th century onwards. These include English, Spanish, French, Portuguese, Dutch, and German. It’s important to note that these languages were also ‘colonizer’ languages for the Indigenous peoples of North America, not only Latin America (Central and South America).”
[3] Thiong'o, N.W. (2005). Decolonizing the Mind: The Politics of Language in African Literature. East African Educational Publishers Ltd.
Aldo Berríos has a Master's degree in Applied Linguistics from the University of Concepción. He co-founded the Kimeltuwe project that promotes and teaches Mapudungun (Mapuche language) on the internet, and has hundreds of thousands of followers. As a Mapuche language teacher, he has prepared and published reference materials for teaching and learning processes. His academic interests are the variation and diversity expressed especially at the phonological and morphological level of Mapudungun.
Ana Alonso Ortíz is anthropologist, linguist, speaker, and member of the Dill Yelnbán organisation and the Zapotec community in Oaxaca, Mexico. Ana’s work focuses on the linguistic study of the Zapotec language. She also works on language assessment, generally researching ways to assess language proficiency in Indigenous languages. As an anthropologist, she looks at the relationship between language and culture across Zapotec borders.
Claudia Pozo is a Bolivian brown feminist and human rights technologist with a Master's degree in Development Studies and a BA in Communications. Claudia is a multifaceted activist, social scientist, strategist and tech person who has worked as a web developer and content producer across diverse formats and in multiple languages for over 15 years. She is also the coordinator of the Language Justice programme at Whose Knowledge?