The Quest For AI Training Data: A Roadmap For Future Development

Written by our partner Dahlberg Data Insights as part of the AI Training Data for Agiculture experiment.

Together with our team of regional experts, the data analysts assessed the applicability of AI within the context of international cooperation. Their goal was to answer the question: to what extent is GIZ data applicable to the task of training AI models (and how can we use this information to create a roadmap for more accurate, informed policymaking)?

Missing Millions: The Persistent Problem of Inefficient Datasets

Setting target global Sustainable Development Goals (SDGs) becomes difficult when we have no clue where to start. Recent research estimates that the number of people currently living in extreme poverty – that is, on less than $1.25 per day – may be under-estimated by as much as 20% (around 350 million people), making poverty eradication (the first of 17 SDGs) a much more challenging objective to achieve. Access to accurate data is not only important to monitoring progress towards these SDGs but is also of crucial importance to “evidence-based” policymaking on a wide variety of issues, from social inclusion and food security to transportation planning.

Yet, most of the information that currently informs policymaking is inferred and extrapolated from the results of traditional collection methods (such as household surveys). These results are problematic, as the financial and time costs of such methods inevitably lead to imperfect coverage and low representation. The small size of datasets collected allows reliable estimates to be made only at a coarse granularity. When managing development issues where information on the socio-economic fabric of a region’s population and its access to key resources (such as food) is crucial for operational decision making, these challenges become particularly severe.

Shifting Our Approach: How Data Innovation Can Fill the Gaps

On the heels of these findings, several actors in the development space have called for the collection of additional data, culminating with the UN’s 2030 Agenda for Sustainable Development which classified the issue as a top priority. In response, many from both the private sector and academia have heeded the call (see, for instance, the work of the Global Partnership for Sustainable Development Data), outlining the potential of recent developments in artificial intelligence to fill the gaps in data-scarce environments.

The societal shift towards a “data economy” results in the generation of increasingly large amounts of information, both intentionally (thanks to advances in technologies such as satellite imagery) and as byproducts of everyday activities (sometimes referred to as “Big Data”). While unstructured and not necessarily collected for the purpose of delivering specific insights, this data still has enormous potential that can be unlocked with the analytical power of various computational models broadly known as “machine learning” (a subfield of artificial intelligence). For example, satellite imagery has been used to predict poverty in Africa; mobile phone data has been used to study human mobility after disasters; search engine data has shown promise for epidemic detection; and social media data has been used to estimate unemployment.

Teaching the Machine: The Promise and Challenge of AI

Machine learning models generally seek to either classify new observations into categories (a classification problem) or predict unknown values for specific features of those observations (a regression problem). To do this, a model must be provided a set of examples – known as “training data” - to learn from. For instance, if we want to train a model to predict the wage of a football player based on their height, speed, and passing accuracy, then we would train the model on a collection of real player profiles, each labeled with their correct wage. The catch? In the absence of correct data, the model simply cannot provide accurate predictions.

Globally, there are new data sources with the potential to provide more frequent, high-resolution, and large-scale insights on social issues, but we still require accurately labeled training data to calibrate the models. This implies that, in most cases, data innovation will still require the collection of traditional data (household surveys and field campaigns, for example), making the approach still too costly and inefficient and posing one of the main challenges to unleashing the full potential of artificial intelligence for tackling issues of socio-economic development.

Uncovering the hidden Data Treasury

Fortunately, there may be another route: unknowingly, many development agencies sit on a wealth of data that is waiting to be mined for just such applications. It includes household surveys and field data (such as crop yield estimates) collected for a specific use (like project monitoring and evaluation), and then often never looked at again. These institutions have already gone through the painstaking process of ground data collection several times, acquiring a treasure of datasets which can be reused to train several models for varying purposes. Aware of this opportunity, the new GIZ Data Lab - in partnership with Dalberg Data Insights - recently delivered new value from data that previously lay idle in the GIZ records.

A pilot experiment demonstrated how field data collected five years ago via satellite imagery for the purposes of monitoring and evaluating farmers can be used to produce an exhaustive crop-type map of Burkina Faso. This project validated the importance of setting up a strategy for a “data treasury” within development agencies - a secure place to store collected data in an orderly fashion, even after it has fulfilled its primary use, so it remains accessible for countless future uses. This increases the value of internal data for the better prioritization of future interventions, assessment of program impact, and search for answers to research questions. Furthermore, it allows institutions to position themselves on the international scene as a provider of AI training data for positive global progress.

A Strategic Roadmap: Creating Accessible AI Training Data For Development

A data treasure consists partly of historical data but also of future data yet to be collected. As a result, we must shift toward a more comprehensive, forward-looking approach, potentially collecting more data than is required to answer the immediate question at hand. By these methods, we can maximize the benefits of data collection efforts, making them suitable for AI training algorithms capable of extracting relevant and new insights from Big Data.

To fully unleash the data treasury potential, a data strategy should be put in place to ensure that:

Historical data is stored and made easily accessible to verified and trusted data scientists.
Data is ready to use - that is, provided in an appropriate format, cleaned and documented.
Data is easy to discover internally but also by trusted third-parties through effective search tools that rely on quality metadata. This requires the establishment of adequate technical infrastructure, as well as data governance protocols aiming to regulate data access, data licensing, privacy issues, etc.

We can only take full advantage of the data revolution by bringing together all pieces of the puzzle, and training data is a central part. The broad development sector and its key stakeholders must work together to create a concrete roadmap for extracting the most value from existing data collection efforts, allowing this data to be effectively combined with AI and non-traditional data sources for the delivery of missing insights essential to evidence-based policymaking. Such an approach goes beyond the digitization of aid agencies, foundations, or international organizations. By structuring a data strategy and governance, development agencies and data innovators can accelerate the uptake of the data revolution, moving forward confidently in the achievement of sustainable development goals.