Can State-of-the-Art Machine Learning Tools Give New Life to Household Survey Data?

FEATURE STORY May 30, 2018

Can State-of-the-Art Machine Learning Tools Give New Life to Household Survey Data?

MULTIMEDIA

VIDEO Feb 27, 2018

Machine Learning and the Future of Poverty Prediction

STORY HIGHLIGHTS

Poverty measurement relies on collecting detailed household survey data, which is often expensive and time-consuming.

Researchers at the World Bank are experimenting with the latest machine learning tools to expand the use and value of these datasets.

Data scientists need to commit to openness around data, software, and methods to make machine learning an effective tool for tracking poverty.

In 2014 the UN called for a to put the best available tools and methods to work in service of achieving the Sustainable Development Goals. Researchers at the World Bank have responded to that call by scouring the globe for the latest machine learning tools to transform our approach to tracking progress in the fight against poverty.

��Collecting household survey data on poverty is expensive and time-consuming, which means that policy makers are often making decisions based on old data,�� said Asli Demirguc-Kunt, Director of Research at the World Bank. ��Machine learning could drastically change the game, making poverty measurement cheaper and much closer to real-time.��

Machine learning is a field of computer science that allows computers to examine large bodies of data to identify patterns that data scientists would never find on their own.

Olivier Dupriez, a statistician at the World Bank, unveiled results from his and colleagues�� ongoing work on machine learning and poverty measurement at a recent Policy Research Talk. According to Dupriez, machine learning could make it possible to use a handful of easy-to-collect pieces of household information��such as whether a household has basic appliances or consumes a specific type of food��to predict the poverty status of a household. This would allow for poverty estimation based on simpler household survey instruments, and, hopefully, more effective policies to end poverty.

Data scientists have been developing the foundations of machine learning over the last half century, but its use has only become commonplace with the vast increases in computing power of the last decade. Today, machine learning is employed in everything from facial recognition technology to spam filters. In the case of poverty measurement, machine learning tools are employed to examine large national household survey datasets��which can be augmented with data from other sources, such as geospatial data��to develop models that can most effectively identify poor households.

From the outset, Dupriez and his colleagues faced a challenge: academia and the private sector have already created a multitude of machine learning tools, each with its own advantages and disadvantages. One tool may do a great job of predicting national poverty rates, but perform poorly when identifying specific households that fall below the poverty line.

Given the large number of options, the first stage of research has been exploratory. Dupriez and his colleagues have used large, comprehensive datasets on household poverty from Indonesia and Malawi to test four different approaches��with widely varying results.

The first of these applied 10 out-of-the-box machine learning algorithms to predict which households in Malawi and Indonesia are poor based on a handful of available variables. None of these algorithms proved to be a clear winner; different tools performed better on different metrics of poverty prediction. However, combining the results of these 10 out-of-the-box algorithms significantly improved predictive performance on most metrics.

ŷ��b��Ƭ researchers also crowdsourced options through a global that attracted 6,000 submissions. The five winners from Portugal, Russia, China, and the Philippines faced the challenge of predicting household poverty status using anonymized datasets; they had no idea what country the data were from or whether specific households were classified as poor. Their models slightly outperformed the best of the 10 out-of-the-box algorithms used in the first approach.

Dupriez also employed an automated machine learning process, where the computer itself identifies the optimal way to build a model of poverty prediction. This approach requires a truly high-powered computer��in this case a computer with 32 processors worked for two days straight. The model produced via automated machine learning proved to be a disappointment, however, as it fell short on most quality metrics. But this may have been the result of the task assigned to it��the computer was directed to optimize only a single metric.

Instead, the most impressive results came from pairing experts and machines. Big tech companies like Google, Netflix, and Amazon have been innovating in the area of machine learning to produce content recommendation systems. Dupriez hired outside experts (Peter Bull and Casey Fitpatrick of ), who deployed an algorithm similar to the one Google uses for its Play Store. This algorithm produced excellent results on almost all metrics, making it a strong contender for additional investigation despite its complexity.

While the results so far are promising, the research is still in an early phase, with much more work needed to turn machine learning into a practical tool that can make poverty measurement faster and cheaper. Globally, data scientists need to commit to openness around data, software, and methods to make this a reality.

"The reason that our outside experts were able to build the best performing model is that they know a lot about what model works with what kind of data and what kind of challenge. We need to develop a library of reproducible scripts and open data where researchers can find any kind of machine learning model, no matter who created it, so that we as a community can apply it to the most pressing challenges the world faces."

Olivier Dupriez

Lead Statistician

��The reason that our outside experts were able to build the best performing model is that they know a lot about what model works with what kind of data and what kind of challenge,�� said Dupriez. ��We need to develop a library of reproducible scripts and open data where researchers can find any kind of machine learning model, no matter who created it, so that we as a community can apply it to the most pressing challenges the world faces.��

ŷ����b��Ƭ