Every Language Matters: Building a More Inclusive Digital Future

October 5, 2024
a group of people posing for the camera
Photo: UNDP Nigeria

Authors:  

Alena Klatte, Data Lead, UNDP Chief Digital Office  
Alexander Hradecky, AI Policy Expert, UNDP Chief Digital Office  
Barbora Bromová, AI & Data intern, UNDP Chief Digital Office  
Dr Nii Longdon Sowah, Senior Lecturer – Department of Computer Engineering, University of Ghana 

With the rapid rise of artificial intelligence (AI) as a major driver of technological progress, democratizing access to digital opportunities through greater representation and inclusion of diverse languages has never been more important.  

 

Linguistic diversity remains a defining feature of our global cultural landscape. An estimated 7,000 languages are spoken worldwide, each a signifier of heritage and identity. Today, linguists estimate that most of the world’s unique languages face severe endangerment or extinction by the end of this century, threatening a loss of entire knowledge systems and jeopardizing the preservation of much of humanity’s rich cultural heritage. The digital realm is facing similar linguistic imbalances, with most online activity taking place in high-resource languages, such as English.  

Low-resource languages—such as Twi (mostly spoken in Ghana), Sinhala (mostly spoken in Sri Lanka) and Quechua (an Andean-Equatorial language)—appear on less than 0.1 percent of websites. This is creating a digital divide that limits equitable access to digital innovation. Many communities lack fluency in the second or third languages needed to access essential online services such as e-learning or telemedicine. 

 

AI models underperform for speakers of low-resource languages 

Large language models (LLMs) rely heavily on data scraped from the Internet, where the world’s highest-resource languages are overrepresented. As a result, AI models underperform for the 1.2 billion native speakers of low-resource languages. Even when AI systems can process queries in low-resource languages, their outputs tend to be slower, less culturally relevant, and insufficiently covered by model safeguards. Voice capabilities are also limited—presenting a significant accessibility barrier, particularly for communities with strong oral traditions or low literacy.  

The consequences of this disparity are far-reaching—and disproportionately impact communities in the Global South. Closing the language divide in AI is critical to catalyse locally relevant innovation, empower marginalized communities, and ensure no one is left behind no matter the language they speak. 

 

Addressing the divides with diverse projects on the ground 

Despite these challenges, promising initiatives are emerging across public, private, non-profit, and international actors to address the lack of low-resource language representation in AI.  

On the African continent, organizations such as Lelapa AI, Masakhane, GhanaNLP, or Professor Vukosi Marivate’s Data Science for Social Impact Research Group, are making significant progress in incentivizing community efforts to deploy datasets for African low-resource languages and integrating them into AI systems.   

Global organizations like Lacuna Fund, GIZ FairForward, International Development Research Centre (IDRC), Mozilla Common Voice, Google Research, and OpenAI are supporting these efforts and contributing resources to making low-resource language data and African Language AI products more widely available. 

Mozilla Common Voice

Launched by the Mozilla Foundation in 2017, Mozilla Common Voice is crowdsourcing initiative for collecting language text and audio to train and test speech recognition software. Mozilla operates two platforms— Scripted and Spontaneous—and currently serves 130+ linguistic communities.  

The Scripted platform invites speakers to submit sentences, record speech samples, or review clips submitted by other users.  

Spontaneous Speech reduces reliance on public domain text by serving speakers prompts to which they respond and then others transcribe. This allows for more diverse datasets, including code-switching and sociolects. Datasets are released every three months under a free CC0 license. Common Voice has more than 2 million downloads, and is used by academic researchers, companies, civil society and governments.

 

While these existing initiatives are welcome, they remain under-resourced, regionally fragmented and often lack the benefits that come with broader cross-sectoral collaboration. This has presented challenges for the interoperability and optimization of datasets for machine learning and AI modelling.  

Progress has also been slowed by uncertainties around data governance, as each community differs in their priorities regarding consent, privacy, and representation. Current projects have yet to achieve the scale necessary to close the AI performance gap for low-resource languages. Addressing this challenge requires a multi-stakeholder effort to catalyse collective action and ensure no community is left behind.  

Advancing local language innovation in Ghana

Among Ghana's many languages, Twi is one of the most widely used. Spoken by approximately 18 million people in Ghana and neighbouring countries, Twi is part of the Akan language cluster within the Niger–Congo family. It serves as an important lingua franca in the region and features three primary dialects of Asante, Akwapim, and Fante, with Asante serving as the basis for a standardized literary language for Twi in Ghana in the early to mid-20th century.

Ghana is home to multiple academic, commercial, and not-for-profit projects relevant to low-resource language digitalization and natural language processing.  

Many of the most promising advances—including GhanaNLP’s app named Khaya, which supports automatic speech recognition and translation in Twi, Ga, Dagbani, Yoruba and other local languages—have emerged as grassroots, open-source initiatives driven by volunteers.  

In parallel, commercial actors have focused on voice data and non-standard speech recognition, with Google’s first AI research centre on the African continent piloting it's Project Relate for users with atypical speech patterns in Accra.

However, despite these developments, significant challenges remain. Dr. Nii Longdon Sowah, Senior Lecturer at the University of Ghana, Department of Computer Engineering, highlights a critical issue: "Most language data troves are created as part of foreign-funded projects, and rarely made available to the local community. For new and locally relevant AI innovation to thrive, a transparent and long-term commitment to digitalizing local languages is needed." 

 

UNDP pilots a ‘learning by doing’ Local Language Partnerships Accelerator Pilot 

UNDP is facilitating locally owned, globally supported partnerships for large-scale digitization of low-resource languages. In collaboration with local communities, academia, civil society, and private sectors, the project aims to digitize languages and strengthen the capacity of local stakeholders to innovate. 

International partners, including UNDP, will provide resources, support policy development, and coordinate existing efforts to enable large-scale digitalization.  

As such, UNDP launched a Local Language Partnerships Accelerator Pilot as part of the AI Hub for Sustainable Development. This initiative is co-designed with the Italian G7 Presidency and seeks to accelerate AI development in African countries in particular. This four-month learning space will assess effective and ethical partnership models that can catalyse the development and adoption of African Language AI for sustainable local innovations. The pilot invites cross-sector partners to collaborate through the Partnerships Accelerator Pilot and contribute to the design of the AI Hub for Sustainable Development.  

 

We will be asking partners to consider the following critical topics:  

  • Innovative approaches to digitalization  
  • Community ownership models 
  • Responsible and scalable partnership structures 
  • Optimal methods for translating inclusive language technologies into public and economic value 

 

. . .

 

 

Join our newsletter for monthly updates on digital for development.