MbazaNLP - Building an NLP Community for Sustainable Digital Development
August 3, 2022
Erik Lehmann, Daniel Glatter
Visiting Fair Forward in Kigali, Rwanda
In May I got the opportunity to leave my desk in Eschborn to visit a project abroad and to learn about implementations on the ground in our partner countries. I joined the project at a time when the implementation was already completed and the remaining challenge was to hand it over and scale it. An important recurring task in development projects, which in this case was very promisingly accomplished.
But let's start with the context, Rwanda is a small, landlocked country in East Africa with a population of about 12 million people, known as the land of thousand hills, often referred to as the Switzerland of Africa. The official language is Kinyarwanda, but French and English are also spoken. Its capital Kigali has slightly above one Million inhabitants and is an African tech hub and a site of the GIZ FAIR Forward project.
Kinyarwanda Covid-19 chatbot
The BMZ-financed project “FAIR Forward - Artificial Intelligence for All” strives for a more open, inclusive and sustainable approach to Artificial Intelligence (AI) on a global level. One area of focus is language and text-based AI, also known as Natural Language Processing (NLP) and one of the activities of FAIR Forward in Rwanda is the Covid-19 chatbot, developed jointly with the Rwandan Biomedical Center and the local start-up Digital Umuganda. The chatbot is a service that answers questions and provides information about COVID-19, available in the local language Kinyarwanda, English, and French, reaching more than 15,000 people every day, even in remote regions. (4) The team worked on advancing the service towards conversational AI, to allow users to interact with the bot via speech. A big challenge here is the lack of textual resources to build the NLP technology for Kinyarwanda.
Kinyarwanda is a low-resource language, meaning that the number of datasets, NLP models, tools and materials that are publicly accessible is very limited. However, Kinyarwanda is an important part of Rwandan culture and society, spoken by almost all of the native population and especially in rural areas it is sometimes the only language used.
To enable verbal interaction with the chatbot, Speech-To-Text and Text-to-Speech technology is needed and had to be developed from scratch for Kinyarwanda. This AI application requires large amounts of training data, so the first project step was the collection of voice and text data in Kinyarwanda. In collaboration with the Rwandan government, the local startup Digital Umuganda was able to collect 2,000+ hours of Kinyarwanda voice data and published them open source on the Mozilla Common Voice platform. While data collection was a great success, training the AI model proved to be more difficult and did not yet meet the high-quality requirements for a fully automated conversational chatbot application.
In data science, the open source idea is very common; public so-called code libraries, which everyone can install from platforms like GitHub, are used and built upon, enabling much faster development. A new, fast-rising open source platform is Hugging Face, which is dedicated to the, often huge, AI models and their infrastructure. Hugging Face provides a platform for researchers, data scientists, and engineers to store, showcase and share their machine learning models and eases the collaboration.
Open source is a type of software, data, or knowledge that can be freely used, copied, studied, and modified by anyone. It is created through the collaboration of individuals who work voluntarily on the project and share their work with others in the open source community via platforms like Wikipedia or GitHub.
GitHub is a code hosting platform that developers use to store, manage, and share their code. It's also a social network for developers where they can follow other developers, find collaborators for their projects, and learn from each other.
Publishing the results
During my visit, we worked on the publication of the code and models created for the chatbot. Our goal was to make them freely available to allow a broad audience to use and refine them. To achieve this, we modularized the chatbot, which means we split it into different components which can be used independently, including the part for Kinyarwanda, English, the Speech-to-Text Module and more, and published the code on GitHub. The AI models were published on Hugging Face and thus close the gap for Kinyarwanda speech technology., The technology is now publicly accessible and can be used in software or as the baseline for advancement. In addition, we built small applications using the Hugging Face Spaces feature which can be used to test and showcase the models, like the “Kinyarwanda Speech Recognition”. This makes the technology tangible even for a non-technical audience
These two repositories on GitHub and Hugging Face build the technical foundation of the Mbaza NLP community The human one was created upon the existing partner network from government, academia and startups known through previous events like workshops and an AI-fellowship program. Together they defined their scope:
The Mbaza NLP Community is open to anyone interested in AI and NLP and does not just aim to give access to and improve existing models. It also places a big focus on knowledge exchange and joint learning of its members via training, hackathons, and webinars in order to support the wider AI ecosystem in Rwanda. Furthermore, it encourages gaining practical experience and developing new use cases building on existing technologies such as speech recognition and chatbot.
MbazaNLP is following the example of communities such as Mashakane, an association of students, researchers, data scientists and more, with the goal – “[…] to strengthen and spur NLP research in African languages, for Africans, by Africans.” to help to preserve the local languages and culture.
Mbaza NLP Kick-Off
At the official kick-off meeting of the Mbaza Community, Chris Emezue from Mashakane NLP opened the event in front of an audience of 30 people. After the onboarding to the chatbot project and the demonstration of developed technologies, the next steps for the community were discussed and follow-up meetings were announced. Today, the community has grown to over 100 members on its various channels, such as GitHub, Slack, and WhatsApp, and helps to connect the local community. The topics discussed already went further than the chatbot application.
The community involvement during the project and especially for the handover went beyond maintaining the code, it simultaneously improved the open resources and local capacity in emerging technologies such as NLP. The networking provided a basis for new collaborations and use cases, apart from GIZ involvement.
All the best to Mbaza NLP!
On more thing
we found an open-source speech model for Kinyarwanda with considerably better results on the Hugging Face platform, trained by the Speechbrain open-source community. They themselves used the public speech data collected for the Covid-19 chatbot and published it on the Open Voice platform.
The publication of resources on well-known open platforms promotes collaboration on common interests even without consultation. An excellent example of the mutual benefits of open-source culture