Assessing Classification Performance for Sampled Remote Sensing Data
Keywords: Sampling, Metadata, Crop Classification
Abstract. Big data poses challenges for storage, management, processing, analysis and visualisation. One technique of handling big data is the use of a representative sample of the data. This paper proposes a sampling algorithm which makes use of multivariate stratification with the aim of obtaining a sample that best represents the population while minimising the number of images in the sample. The proposed sampling algorithm performs effectively on a big spatial image dataset of crop types. The results are assessed by measuring the number of images sampled and as well as matching the proportionality of the population crop percentages. The samples obtained from the proposed algorithm are then used for land cover classification. An ensemble method called random forest is trained on the samples and accuracy is assessed. Precision, recall and F1-scores per crop type are computed as well as the overall accuracy. The random forest classifier performed best on the proposed sample with the least number of images. In addition, the classifier performed better on the proposed sample than it did on a random sample as the proposed sample due to the more informative data. This research develops an effective way of sampling big data for crop classification.