Aerial Scene Parsing: From Tile-level Scene
Classification to Pixel-wise Semantic Labeling

Yang Long1, Gui-Song Xia2,1,*, Liangpei Zhang1, Gong Cheng3, Deren Li1.

1. State Key Lab. LIESMARS, Wuhan University, Wuhan 430079, China
2. School of Computer Science, Wuhan University, Wuhan 430079, China
3. School of Automation, Northwestern Polytechnical University, Xi'an 710072, China


Road map




- Introduction -

Given an aerial image, aerial scene parsing (ASP) targets to interpret the semantic structure of the image content, e.g., by assigning a semantic label to every pixel of the image. With the popularization of data-driven methods, the past decades have witnessed promising progress on ASP by approaching the problem with the schemes of tile-level scene classification or segmentation-based image analysis, when using high-resolution aerial images. However, the former scheme often produces results with tile-wise boundaries, while the latter one needs to handle the complex modeling process from pixels to semantics, which often requires large-scale and well-annotated image samples with pixel-wise semantic labels. In this paper, we address these issues in aerial scene parsing, with perspectives from tile-level scene classification to pixel-wise semantic labeling. Specifically, we first revisit aerial image interpretation by a literature review. We then present a large-scale scene classification dataset that contains one million aerial images termed Million-AID. With the presented dataset, we also report benchmarking experiments using classical convolutional neural networks (CNNs). Finally, we perform ASP by unifying the tile-level scene classification and object-based image analysis to achieve pixel-wise semantic labeling. Intensive experiments show that Million-AID is a challenging yet useful dataset, which can serve as a benchmark for evaluating newly developed algorithms. When transferring knowledge from Million-AID, fine-tuning CNN models pretrained on Million-AID perform consistently better than those pretrained ImageNet for aerial scene classification, demonstrating the strong generalization ability of the proposed dataset. Moreover, our designed hierarchical multi-task learning method achieves the state-of-the-art pixel-wise classification on the challenging GID, which is a profitable attempt to bridge the tile-level scene classification toward pixel-wise semantic labeling for aerial image interpretation. We hope that our work could serve as a baseline for aerial scene classification and inspire a rethinking of the scene parsing of high-resolution aerial images.

- Revisiting Aerial Image Interpretation -

With the progress of sensor technology, the spatial resolution of aerial images has witnessed a continuously improvement, which has also greatly promoted the development of aerial image interpretation. Consequently, the interpretation tasks of aerial images has experienced a long course from pixel-wise image classification, segmentation-based image analysis, to tile-level image understanding, relying on the visual characteristics of aerial images with different resolutions.

- Scene Classification: A New Benchmark on Million-AID -

Data-driven algorithms represented by deep learning have been reported with overwhelming advantages over the conventional classification methods based on handcrafted features, and thus, dominated aerial image recognition in recent decade. In this section, we train a number of representative CNN models and conduct comprehensive evaluations for multi-class and multi-label scene classification on Million-AID, which we hope to provide a benchmark for future researches.

- Million-AID

- Multi-class classification

- Multi-label classification

- Transferring Knowledge From Million-AID -

Million-AID consists of large-scale aerial images that characterize diverse scenes. This provides Million-AID with rich semantic knowledge of scene content. Hence, it is natural for us to explore the potential to transfer the semantic knowledge in Million-AID to other domains. To this end, we consider two basic strategies, i.e., fine-tuning pre-trained networks for tile-level scene classification and hierarchical multi-task learning for pixel-level semantic parsing.

- Fine-tuning pre-trained networks for scene classification

Classification Accuracy (%) on AID Dataset Using Different Training Schemes

Classification Accuracy (%) on NWPU-RESISC45 Dataset Using Different Training Schemes

- Hierarchical multi-task learning for semantic parsing

The conventional CNN learns scene features via stacked convolutional layers and the output of the last fully connected layer is usually employed for scene representation. However, learning stable features from single layer can be a difficult task because of the complexity of scene content. Moreover, data sparsity which is a long-standing notorious problem can easily lead to model overfitting and weak generalization ability because of the insufficient knowledge captured from limited training data. To relieve the above issues, we introduce a hierarchical multi-task learning method and further explore how much the knowledge contained in Million-AID can be transferred to boost the pixel-level semantic parsing of aerial images. To this end, the GID, which consists of training set with tile-level scenes and large-size test images with pixel-wise annotations, has provided us an opportunity to bridge the tile-level scene classification toward pixel-level semantic parsing. Generally, the presented framework consists four components, i.e., hierarchical scene representation, multi-task scene classification (MSC), hierarchical semantic fusion (HSF), and pixel-level semantics integration as shown in below.

- Qualitative comparisons among different classification schemes

Images in the first to fifth columns indicate the original image, ground truth annotations, classification maps of baseline, MSC, and the full implementation of our method, respectively.

- Performance comparison among different methods

- Visulization of classification results

Visualization of the land cover classification results on the fine classification set of GID. Images in the first to fourth columns indicate the original image, ground truth annotations, classification maps of PT-GID, and classification maps of our method, respectively.


For the construction of Million-AID, please refer to the second item of the following citations.


A public evaluation platform for the multi-class and multi-label scene classification based on Million-AID.


title={Aerial Scene Parsing: From Tile-level Scene Classification to Pixel-wise Semantic Labeling}, 
author={Yang Long and Gui-Song Xia and Liangpei Zhang and Gong Cheng and Deren Li},

title={On Creating Benchmark Dataset for Aerial Image Interpretation: Reviews, Guidances and Million-AID},
author={Yang Long and Gui-Song Xia and Shengyang Li and Wen Yang and Michael Ying Yang and Xiao Xiang Zhu and Liangpei Zhang and Deren Li},
journal={IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing},


If you have any problem, please contact:

  • Yang Long at
  • Gui-Song Xia at