Detecting human settlements from satellite images using computer vision

In our previous blog posts we explored different approaches and opportunities to track two indicators of Sustainable Development Goal (SDG) 11 - the “ratio of land consumption rate to population growth rate” and the “average share of the built-up area of cities that is open space for public use”, with alternative data. This data includes maps of build-up areas as well as population density maps, which were published by institutions like the European Research Center (JRC) or WorldPop. The maps are free to use, cover the whole globe, and are therefore widely used by policymakers, researchers, and development institutions. Understanding how those maps are produced helps to customize them according to your needs and enables users worldwide to develop new geospatial applications.

Up-to-date build-up maps are a prerequisite for both indicators

For the calculation of both indicators the cities footprint, so-called built-up area, is in the denominator and to calculate change we require pictures of the same location at two different points of time. The JRC published a series of build-up maps under the Global Human Settlement Layer (GHSL), based on Landsat satellite imagery for the years 1975 – 1990 – 2000 and 2015. They are well suited to make a historical comparison, but for the calculation of the SDG 11- indicators, the six-year-old data was not recent enough. In 2018 the JRC added new Sentinel-2 images from Copernicus to their database, which provide a higher resolution of 10 meters per pixel (compared to 30m/p for Landsat). However, due to the different resolution, these images can’t be compared consistently and a second release of the classification based on Sentinel-2 is not available yet. Therefore, the only option is to build our own predictive model and use it to classify consecutive and consistent images.

Quality training data is usually the biggest obstacle

To build this kind of model annotated training data for the algorithm to learn from is needed. There are two sources we considered: The first option was to use Sentinel-2 satellite images from 2018 and use the prediction for JRC as labels, both are free to use. These predictions have certain limitations, e.g. one disadvantage of using them as a label is that the new model automatically inherits errors. To improve we’ve experimented with Planet Labs imagery which provided us with even higher resolution satellite imagery. Besides the Basemaps, which come with a resolution of 3m per pixel, we also received SkySat images from Plant with a maximized resolution of 50cm per pixel. We combined them with property maps of Medellin from our partner DANE, Colombia's National Administrative Department of Statistics. Since the property is not an equal measure to built-up areas, this training data was not perfect either, but the high resolution makes it much easier to recognize building footprints.

Medellin property outlines on 1) Sentinel-2 10m 2) Planet Basemap 3m 3) Planet SkySat 50cm resolution

We realized that clouds pose a particular difficulty when working with our satellite imagery. Medellin is close to the equator and therefore, the weather is very often cloudy with implications on the ability to observe the surface of the earth. The more frequently pictures are taken, the higher the probability of getting a cloud-free image – the Sentinel 2 mission revisits Colombia every seven days. Even more effective - than taking the least cloudy image - is to combine several images in a composite. This is also done by Planet in their base maps, the high-resolution SkySat image however was only taken once. Another characteristic of satellite imagery is that satellites not only record the frequencies visible to humans (red, green, blue) but up to 9 other frequency bands (Sentinel-2). One of them covers near-infrared light, which can be used to detect vegetation – a non-build-up area. This information is useful for the classification of settlements and we thus added it to our analysis.

Up to this point, the data is a small collection of very large images. Especially the high-resolution SkySat Images are too big to handle in storage. To be able to feed the data into a model we cut it into parts – in other words, we divided the imagery and the corresponding labels into smaller images of 256 by 256 pixels. The respective differs significantly between both data sets due to the different resolutions. While 256 pixels of the Sentinel image cover whole streetscapes, the SkySat image shows only a few houses.

1) Sentinel-2 image with a masked (black) region due to clouds 2018 / JRC GHSL 2018
2) Planet SkySat image 2021 / Medellin property map 2021

In the last step before we fed the data into our model, we split it into two subsets. The first subset is used to fit the model and is referred to as the training dataset and contains 80% of the data. The second subset is not used to train the model. Instead, the images of the second subset are provided to the model for prediction. The predictions are then compared with the labels to test the performance. This subset is referred to as the testing dataset and contains the remaining 20%.

A convolutional neural network able to translate a satellite image to an image of urban footprint

Deep learning has improved the state-of-the-art in various scientific domains, outperforming rule-based methods for i.e. natural language processing. In remote sensing, deep learning is equally disruptive. The availability of high-resolution satellite imagery in combination with computing power and neural networks architectures sets new benchmarks. The emerge of non-linear machine learning models like random and has enabled application fields, such as the classification of built-up areas. However, these algorithms look only at individual pixels, but neighboring pixels are not independent of each other. The more neighbors are built-up, the more likely the pixel itself is built-up. Accordingly, convolutional neural networks (CNN) also look at the surrounding pixels for their prediction. These computer vision algorithms are i.e. used to distinguish images from cats and dogs or to recognize road signs for autonomous driving. In the context of our satellite images, we not only want to know if there is a dog in our picture, but where it is - we want to know its outline. This is a typical semantic segmentation problem and the U-Net is the most popular CNN algorithm for this problem, which was initially developed for biomedical image interpretation.

The 256x256 pixel patches are fed into the U-Net which learns to predict the build-up areas. The output is not a single value, but a 256x256 pixel build-up prediction where every pixel holds a value for its assigned probability of being build-up – like an image-to-image translation. This prediction is compared to the original label. Finally, the error (1-overlap) is calculated. This is done in several iterations and with every step the network learns to make fewer mistakes.

1) Sentinel-2 Image 2018 2) JRC GHSL 2018 3) U-Net Prediction 2018

Especially using the Sentinel 2 data the predictions got relatively close to the JRC layer and proved that is feasible to use the predictions of a model as labels to reproduce it. Following the training, the model parameters were saved to enable the reuse of the classification. In the next step, we feed recent Sentinel-2 images of the city of Cali into the model to create a recent build-up map of it. Since this region was not used for training, the results show how well the classification algorithm copes with new data. Both maps show a satellite image of 2021, the left one covered by the JRC prediction from 2018, and the right one covered by the prediction from our model on images of 2021. We observed new build-up spaces around Villa Fatima which are now covered in our prediction. In some cases, the prediction also includes parts of the street, an error that also occurs in the original label, but is amplified now. The prediction is not binary but assigns a probability to every pixel, the dark red areas are more certain than the lighter and yellow ones. By increasing the threshold, areas of lower certainty can be excluded from the build-up layer.

Google Maps Satellite Image 2021 and 1) JRC GHSL 2018 2) U-Net Prediction 2021

The second model, trained on high-resolution data, led to less satisfactory results The labels (property map) deviate too much from the real building footprints. Nevertheless, the model learns the relation relatively well and the prediction (left) looks closer to real building footprints than the original label (right). This illustrates the ability to learn from large amounts of even imperfect data for this class of algorithms. However, roads and cars are almost in any case incorrectly marked as built-up areas. This error in the classification of the SkySat and the Sentinel-2 images is mostly related to the training data. The results can be improved when more effort is put into collecting clean ground truth data.

To finally calculate the SDG indicators we use the classification model based on Sentinel-2, the 10m resolution is sufficient to get usable results from the indicators and to understand the dynamics of the city. In addition, our results are more accurate, the processing is less memory intensive and the images and labels are globally available and the method is thus scalable. The code for the classification including access to and processing of the data can be found here.