Computational Linguists Help Africa Try to Close the AI Language Gap

Introduction

The fact that African languages are underrepresented in the digital AI ecosystem has gained international attention. On July 29, 2025, Nature published a news article stating that

More than 2,000 languages spoken in Africa are being neglected in the artificial intelligence (AI) era. For example, ChatGPT recognizes only 10–20% of sentences written in Hausa, a language spoken by 94 million people in Nigeria. These languages are under-represented in large language models (LLMs) because of a lack of training data.” (source: AI models are neglecting African languages — scientists want to change that)

Another example is BBC News, released on September 4, 2025, stating that

Although Africa is home to a huge proportion of the world’s languages – well over a quarter according to some estimates – many are missing when it comes to the development of artificial intelligence (AI). This is both an issue of a lack of investment and readily available data. Most AI tools, such as ChatGPT, used today are trained on English as well as other European and Chinese languages. These have vast quantities of online text to draw from. But as many African languages are mostly spoken rather than written down, there is a lack of text to train AI on to make it useful for speakers of those languages. For millions across the continent this means being left out.” (source: Lost in translation – How Africa is trying to close the AI language gap)

To address this problem, linguists and computer scientists are collaborating to create AI-ready datasets in 18 African languages via The African Next Voices project. Funded by the Bill and Melinda Gates Foundation ($2.2-million grant), the project involves recording 9,000 hours of speech across 18 African languages in Kenya, Nigeria, and South Africa. The goal is to create a comprehensive dataset that can be utilized for developing AI tools, such as translation and transcription services, which are particularly beneficial for local communities and their specific needs. The project emphasizes the importance of capturing everyday language use to ensure that AI technologies reflect the realities of African societies. The 18 African languages selected represent only a fraction of the over 2,000 languages spoken across the continent, but project contributors aim to include more languages in the future.

Role of Computational Linguists in the Project

Computational linguists play a critical role in the African Next Voices project. Their key contributions include:

  • Data Curation and Annotation: They guide the transcription and translation of over 9,000 hours of recorded speech in languages like Kikuyu, Dholuo, Hausa, Yoruba, and isiZulu, ensuring linguistic accuracy and cultural relevance. This involves working with native speakers to capture authentic, everyday language use in contexts like farming, healthcare, and education.
  • Dataset Design: They help design structured datasets that are AI-ready, aligning the collected speech data with formats suitable for training large language models (LLMs) for tasks like speech recognition and translation. This includes ensuring data quality through review and validation processes.
  • Bias Mitigation: By leveraging their expertise in linguistic diversity, computational linguists work to prevent biases in AI models by curating datasets that reflect the true linguistic and cultural nuances of African languages, which are often oral and underrepresented in digital text.
  • Collaboration with Technical Teams: They work alongside computer scientists and AI experts to integrate linguistic knowledge into model training and evaluation, ensuring the datasets support accurate translation, transcription, and conversational AI applications.

Their involvement is essential to making African languages accessible in AI technologies, fostering digital inclusion, and preserving cultural heritage.

Final Thoughts

From the perspective of a U.S. high school student interested in pursuing computational linguistics in college, inspired by African Next Voices, here are some final thoughts and conclusions:

  • Impactful Career Path: Computational linguistics offers a unique opportunity to blend language, culture, and technology. For a student like me, the African Next Voices project highlights how this field can drive social good by preserving underrepresented languages and enabling AI to serve diverse communities, which could be deeply motivating.
  • Global Relevance: The project underscores the global demand for linguistic diversity in AI. As a future computational linguist, I can contribute to bridging digital divides, making technology accessible to millions in Africa and beyond, which is both a technical and humanitarian pursuit.
  • Skill Development: The work involves collaboration with native speakers, data annotation, and AI model training/evaluation, suggesting I’ll need strong skills in linguistics, programming (e.g., Python), and cross-cultural communication. Strengthening linguistics knowledge and enhancing coding skills could give me a head start.
  • Challenges and Opportunities: The vast linguistic diversity (over 2,000 African languages) presents challenges like handling oral traditions or limited digital resources. This complexity is exciting, as it offers a chance to innovate in dataset creation and bias mitigation, areas where I could contribute and grow.
  • Inspiration for Study: The focus on real-world applications (such as healthcare, education, and farming) aligns with my interest in studying computational linguistics in college and working on inclusive AI that serves people.

In short, as a high school student, I can see computational linguistics as a field where I can build tools that help people communicate and learn. I hope this post encourages you to look into the project and consider how you might contribute to similar initiatives in the future!

— Andrew

4,361 hits

Leave a comment

Blog at WordPress.com.

Up ↑