A COMPARISON OF PRE-PROCESSING APPROACHES FOR REMOTELY SENSED TIME SERIES CLASSIFICATION BASED ON FUNCTIONAL ANALYSIS

: Satellite remote sensing has gained a key role for vegetation mapping distribution. Given the availability of multi-temporal satellite data, seasonal variations in vegetation dynamics can be used trough time series analysis for vegetation distribution mapping. These types of data have a very high variability within them and are subjected by artifacts. Therefore, a pre-processing phase must be performed to properly detect outliers, for data smoothing process and to correctly interpolate the data. In this work, we compare four pre-processing approaches for functional analysis on 4-years of remotely sensed images, resulting in four time series datasets. The methodologies presented are the results of the combination of two outlier detection methods, namely tsclean and boxplot functions in R and two discrete data smoothing approaches (Generalized Additive Model ”GAM” on daily and aggregated data). The approaches proposed are: tsclean -GAM on aggregated data (M01), boxplot -GAM on aggregated data (M02), tsclean - GAM on daily data (M03), boxplot -GAM on daily data (M04). Our results prove that the approach which involves tsclean function and GAM applied to daily data (M03) is ameliorative to the logic of the procedure and leads to better model performance in terms of Overall Accuracy (OA) which is always among the highest when compared with the others obtained from the other three different approaches.


INTRODUCTION
In the last four decades, satellite remote sensing has gained a key role for vegetation and habitat distribution mapping (Zlinszky et al., 2015).The habitat distribution can be properly represented by vegetation map, which, when repeated over time, are useful to assess their preservation (Dash andOgutu, 2016, Viciani et al., 2016).Given the availability of multi-temporal satellite data, seasonal and inter-annual variations in vegetation dynamics (phenology) can be quantified trough time series analysis and used for vegetation distribution mapping (Caparros-Santiago et al., 2021).These types of data have a very high variability within them, due to the fact that they are derived from satellite images acquired at different time periods, using different sensors, capturing constantly changing dynamics, with ever-changing weather conditions and with solar radiance inclination which creates shadows and reflections caught by the sensors (Meraner et al., 2020).Furthermore, if we add to this the technological issues which may occur when dealing with artificial instrumentation (e.g., of sensor malfunctions), we can immediately realize that the data must be fixed before being used for different purposes.The most challenging spectral reflectance abnormal values are often caused by adverse weather conditions, undetected sub-pixel cloud cover, atmospheric dust and gaseous absorbers but also seasons lighting variations, soil-induced disturbances, shading, or sensor glitches (Alvera-Azcárate et al., 2012).All of them are generally referred to as artifacts (Du et al., 2003) and must be removed from the timeseries.These artifacts can alter the temporal pattern of reflectance values, causing a reduction in the accuracy of the pheno- * m.balestra@pm.univpm.itlogical estimations (Clark et al., 2002).Therefore, in order to process the data correctly, it is necessary to implement a data pre-processing phase.Given the very large variability of the data and, consequently, the outliers occurrence, it is essential to operate through diversified methods for the detection of those anomalies to obtain a representative output of the vegetation to classify (Jackson and Chen, 2004).A knowledge about what is intended to classify is fundamental because outliers detection also comes through a deep awareness concerning the expected value of a given class.Without such background, which only the user's experience can provide, information might be lost that, instead, is essential to the classifier.Hence, smoothing of the discrete data becomes a key phase to achieve accurate classification models which describe detailed vegetation dynamics and reduce as good as possible the outlier (Zeng et al., 2020).Considering these issues, it is necessary to use appropriate approaches (i.e data smoothing and outliers detection methods) that will mitigate the noise effects of time-series from satellite imagery (Santos et al., 2021).The greater the care and attention in this step, the higher the quality of the data that will be processed and, hence, the output that will be obtained.In statistics, smoothing consists in the application of a function filter aimed to highlight relevant patterns by mitigating noises generated by environmental, computational, or physiological artifacts (Atkinson et al., 2012).The presence of time gaps, given by the missing data due to both the satellite temporal resolution and the artifacts, makes it necessary to deal also with a data interpolation phase that will allow to obtain a continuous function throughout the year.The interpolation process consists to average the data in a series with contiguous values to describe a pattern, but considering these values in a cyclic way given by their annual nature.Interpolation process can either precede or even be followed by a data aggregation phase.The data, in order to provide valuable insights for the resulting classification, can be processed through functional data analysis (FDA) (Hurley et al., 2014).The latter must represent discrete data as functions, as the FDA needs to analyze the data as a single function rather than a set of point values spread over time (Pesaresi et al., 2022).Therefore, to work by functional analysis, a proper outlier detection step, data smoothing phase and data interpolation can not be ignored.
With this papers, we want to figure out a suitable strategy to adopt when processing time-series, through the comparison among the classification accuracies obtained.Thus, we compared the outputs of four data pre-processing approaches.The approaches presented are the results of the combination of two outlier detection options (tsclean function in forecast package (Hyndman and Khandakar, 2008) and boxplot function in graphics package (Murrell, 2005)) and two discrete data smoothing approaches (Generalized Additive Model "GAM" (Wood, 2006) on daily and aggregated data).The approaches proposed are: tsclean-GAM on aggregated data (M01), boxplot-GAM on aggregated data (M02), tsclean-GAM on daily data (M03), boxplot-GAM on daily data (M04).
The main contribution of this work is to perform four different satellite image time series pre-processing approaches, combining two outliers detection methodology (tsclean and boxplot) and a data smoothing technique (GAM) applied to differently aggregated datasets, testing them within 2 study areas with a combination of predictors and vegetational indicies.

Frasassi Gorge
The study area overlaps with the Special Area of Conservation "Gola di Frasassi IT5320003", covering an area of 728 ha within the Rossa and Frasassi Gorge Regional Natural Park between the municipalities of Genga and Fabriano, in the province of Ancona.The Frasassi gorge is placed inside the anticline of Mount Valmontagnana -Mount Frasassi, in the pre-Apennine mountain belt.The peak elevation of the site is 931.2 m.a.s.l. at "Monte di Valmontagnana", while the minimum elevation measured is 200 m.a.s.l. at the edge of the Esino river's left bank.

Conero Mount
This study area is part of the territory in between the Special Areas of Conservation "Monte Conero IT5320007" and "Portonovo e falesia calcarea a mare IT5320007".It covers an area of 650 hectares within the Conero Regional Natural Park in the province of Ancona, between the municipalities of Sirolo and Ancona.Conero mount is a limestone promontory of 582 m.a.s.l., being the only stretch of limestone coastline from Trieste to Gargano, it interrupts the continuous low and sandy shoreline typical of the Adriatic coast.
The images have been acquired by the Sentinel-2A and Sentinel-2B satellite platforms, both managed by the European Space Agency (ESA) as part of the European Copernicus plan (Pesaresi et al., 2022).For each study area, 93 L2A images referable to the period between April 2017 and March 2020 have been collected using the sen2r package.The satellite imagery acquisition frequency permits to aggregate in accordance with defined temporal intervals.They can be aggregated by year, month (semester, trimester, bimester), week (weeks, biweeks) or days of the year (DoY).Sentinel-2 acquisition frequency allows for a weekly aggregation time, allowing for time-series composed of 52 values.The images have been coregistered, cropped with the shapefiles matching the limits of the study areas and masked by the cloud cover.Seven vegetation indices have been computed for each of these images and each one has been used as prediction variable.Additionally we used the FPCA to group the information related to the curves' temporal variability into a set of main components.The coef- ficients quantifying the weight of each component and maintaining the chronological order of the functional variations are referred to as "scores" (Pesaresi et al., 2020).We used them, in total and in a fraction thereof, as second and third prediction variables in this classification.Moreover, we removed the last component resulting from the FPCA and we reconstructed the time-series and we used them as the fourth prediction variable.The seven vegetation indices are: Normalized Difference Vegetation Index (NDVI), Modified Chlorophyll Absorption in Reflectance Index (MCARI), Green Normalized Difference Vegetation Index (GNDVI), Normalized Difference Red/Green Redness Index (RI), Normalized Difference Red-Edge (NDRE), Normalized DIfference Moisture Index (NDMI), Modified Normalized Difference Water Index (MNDWI).The proposed classification algorithm is Random Forest.In order to ease the readability of the manuscript, the general workflow is reported in Figure 3. combining the dataset to be subjected to one of the 4 pre-processing approaches.

Outliers Removal
Temporal information is obtained by extracting the capture date and then converting it into the corresponding day of the year (DoY).Using this information, data are chronologically sorted and the dataset is thus obtained, which will be subjected to the outlier detection process.The detection methods for the anomalous point values are the most cited in the literature (Willsky et al., 1980, Hu et al., 2021, Venkatasubramanian et al., 2003).Among them, the most widely used are the so-called model-based techniques (Mehrang et al., 2015, Basu andMeckesheimer, 2007), followed by the densitybased (Tang and He, 2017, Tian et al., 2016, Angiulli and Fassetti, 2007) and the histogramming ones (Blázquez-García et al., 2021, Muthukrishnan et al., 2004).In this paper two different methods of point outlier detection are compared: the tsclean function of the forecast package (Hyndman et al., 2020) and the boxplot function of the graphics package (Murrell and Murrell, 2020).Both are part of model-based techniques, meaning that a point x at time t can be declared an outlier if the distance from its expected value x is greater than a predefined threshold τ (Formula 1).
The tsclean function is used to process univariate time series and the detection of outliers is different for seasonal and nonseasonal time series (Kandanaarachchi et al., 2020).In this study the interest is in the former type, where a significant seasonal component is identified in the variation of the phenomenon.Specifically, the function uses a time series decomposition method: Seasonal and Trend decomposition using LOESS (STL) (Cleveland et al., 1990).The STL method employs location-adapted regression models to deconstruct a time series into trend, seasonal, and residual components.The STL algorithm smooths the time series using the locally estimated scatterplot smoothing (LOESS) method in two cycles: an inner and an outer cycle.During the inner one, the seasonal and trend components are calculated.The residual is then found by subtracting these from the time series (Cleveland, 1990).For each time series, outliers are identified and replaced by interpolation.
In the context of outlier detection methods, the boxplot belongs to the techniques using basic statistics.Pixels considered outliers are those having values placed over X times the interquartile range from the first and third quartiles, and represented as isolated points in the plot.In this study, the coefficient X is equal to 1.5, considered as the default value in the boxplot graphics package function of R.Although groups of pixels can be processed simultaneously, the function is configured by considering the single pixel as a univariate time series (Bernard et al., 2012), making the outputs comparable to those obtained with the tsclean function.As opposed to the latter, the boxplot function permits to analyze the data by selecting a chosen time span.Therefore, it is necessary to set a vector containing the time span that best considers, in an independent way, the seasonal variability of the data (a value detected anomalous on winter is not necessarily anomalous on summer).In this study, monthly time span has been chosen, based on the density of observations in the DoYs and their seasonal distribution.

Data Smoothing
The proposed smoothing algorithm is a Generalized Additive Model (GAM) based on Cyclic Cubic Spline Regression.GAMs are extensions of linear models in which the predictor is the sum of regular functions plus a conventional parametric component (Wood and Wood, 2015).
GAMs allow the configuration of both complex non-linear relationships and inferential statistics by understanding and explaining the inherent structure of the discrete data dispersion model (Azzalini and Scarpa, 2012).The Cyclic Cubic Spline Regression summarises the multi year variability of vegetation surfaces from satellite imagery into one artificial/ideal year.The Cyclic Cubic Spline Regression is a function composed of several connected polynomials designed to interpolate a set of points into defined intervals, called knots.The number of knots in which the dataset is divided is set through a process of cross validation.Separating the dataset into subsets, localized smoothing is achieved and overfitting (given by the global influence of each point on the fit) is avoided (Faraway, 1992).The Cyclic Cubic Spline Regression provides the connection between the first and last knot to consider the temporal nature of the analyzed curves which describe a continuous trend between December and January.
Interpolation process is performed by the gam function, of the mgcv package (Wood and Wood, 2015).The "GAM on Aggr" approach involves the GAM application on data previously aggregated by weeks, while the "GAM on DoY" one applies the GAM directly on cleaned daily data, not yet aggregated.In the first case, an early data grouping is carried out, in order to compensate the double collection of images in the same days but in different years, and subsequently aggregated by weeks.In the second case, instead, the compensation of double collected values is carried out directly by GAM, to obtain time-series fitted with 365 values.The latter are then aggregated by weeks, making the two approaches comparable.

Predictors
Once that the outliers removal and the data smoothing processes have been performed with the different approaches, each pixel value (recorded throughout the 4 years) has been fit in a yearly time series.The latter presents slight differences according to the pre-processing approach used, as can be noticed in the figure 4.
The aggregated time-series resulting from approaches M01, M02, M03, and M04 are the first predictors used in this study for classification.These are subsequently subjected to FPCA, a statistical method for variance analysis of functional data (Ullah and Finch, 2013).It is well adapted to time-series, where individual observations are not independent, but rather constrained by their chronological order.It provides an estimation of dataset complexity by determining the minimum number of components needed to represent the content of the dataset without loss of information.The "scores" are the parameters quantifying the similarities among the time-series: they provide information on the position, shape and variation of each curve observed in the space (Shang, 2014).The main FPCA outputs, besides the scores, are the eigenvalues and the eigenfunctions.The eigenvalues represent the variation explained by each component for each time-series.Their sum is equal to the overall data variance.The components explaining up to 99% of the total variance are considered to extract a reduced number of scores.Eigenfunctions, on the other hand, estimate the scores' value by describing the largest functional variances for each component (Hurley et al., 2014).Through these outputs, the original data (X iK ) can be rebuilt as a summation of the product of each score (A ik ) by its eigenfunction (ϕ k ), plus the mean value (µ) (Wang et al., 2015) (Formula 2): Through this process the four predictors used in this study were obtained: time-series (TS), scores (ST), reduced scores (SR) and rebuilt time-series (TSR).The classification is performed by Random Forest.The overall accuracy (OA) is defined by the proportion of correctly classified pixels by the algorithm and the total number of pixels used to train the model.The number of decision trees set in the algorithm is 1500.The Random Forest training is performed through repeated cross validation.The latter has been set by the authors and the number of folders [kfold] in which the dataset is divided is 10, while the number of repetitions is 5.All functions used in this section belong to the caret (Classification and Regression Training) package, which is specific to the construction of classification models (Kuhn, 2015).

RESULTS
In this chapter, the tested combinations for defining our best pre-processing approach for remote sensed time-series are shown through boxplot graphs.These various methods, obtained from the 4 pre-processing approaches, using 4 predictors from 7 different vegetation indices in 2 different study areas resulted in a total of 224 models.The charts summarize the variability in accuracies generated by the models for the 4 approaches with respect to the different variables.Therefore, for each plot, each method is compared individually to either a predictor, an index, or a study area.All of the accuracies obtained from the models of each method for each predictor/index/study area being analyzed are then grouped together.In particular, each approach/predictor comparison chart groups 56 (224/4) accuracies.Each approach/index comparison chart groups 32 (224/7) accuracies.Each approach/area comparison chart contains 112 (224/2) accuracies.

Approaches and Predictors
The results obtained from the predictors' classification reveal a substantial difference between the accuracies achieved.Models obtained through TS, ST, and SR predictors achieved higher accuracies than those based on TSR (Figure 5).For each method and index, in both study areas, the TSR thus demonstrate an insufficient models' accuracy and it cannot be compared to the other predictor models (Table 1).Therefore, the results obtained from the TSR are not included in the subsequent charts showing the accuracies achieved by the different approaches (M01, M02, M03, M04) when compared to the vegetation indices and the study areas.The combination tsclean and GAM on DoY (M03) proves to perform better than the other approaches for TS, ST and SR.

Approaches and Vegetation Indices
Analyzing the 32 results obtained for each vegetation index (Table 2), it can be noted that M03 is the best performing method for 5 of the 7 indices used (MCARI, NDVI, NDMI, NDRE, RI) (Figure 6).The M04 method is the best performing for GNDVI and MNDWI.

Approaches and Study Areas
As a result of the 112 results obtained for each study area, the chart (Figure 7) reveals that the M03 method continues to be the most accurate (Table 3).Each of the 224 combinations of approaches has generated a map of the analyzed area, along with  its corresponding confusion matrix and OA.By selecting the models that achieved the highest OA for the two study areas, in Figure 8 we can observe the classification obtained in Frasassi gorge using the M03 approach, with SCR and the vegetation index RI.In Figure 9 we can observe the classification of Conero mount obtained using the M03 approach, SCR and the vegetation index NDRE.The maps were produced from the best results obtained through the combination of different approaches.However, a few classes were misclassified for labels that were not initially represented in our classes.

DISCUSSION
According to the results obtained in this work, the method resulting in higher model performance is M03, which involves the tsclean function and the application of GAM on the daily data.
The tsclean function allows proper dataset cleaning from outliers with limited computation time and proves to be an important tool when processing time-series from vegetation indices, as earlier proved by (Pesaresi et al., 2020, Pesaresi et al., 2022).Nevertheless, the boxplot function has the merit of allowing for outliers removal only (Kerandel et al., 2020) and, moreover, although computation time is slightly stretched, in terms of   coding it is easily handled and outliers can be detected within the time series over different periods of the year (i.e.month, bimester, trimester).Within more heterogeneous environments such as the Frasassi gorge, the boxplot function performs better for certain indices when compared to the tsclean function.In literature, different techniques have been used to carry out the smoothing and correction phase of time-series satellite data, such as the curve fitting (Pickers and Manning, 2015), the Fourier decomposition (Mingwei et al., 2008), the asymmetric Gaussian function (Jonsson and Eklundh, 2002), the double logistic functions (Atkinson et al., 2012, Eklundh andJönsson, 2015), the Whittaker smoother (Shao et al., 2016, Kandasamy et al., 2013), the Savitzky-Golay filter (Huang et al., 2021), the high order spline with roughness damping (Hermance et al., 2007), the spatio-temporal tensor completion method (Chu et al., 2021) and other spatio-temporal combination methods such as the adaptive spatio-temporal weighted method (Li et al., 2017) and hybrid Generalised Additive Model (GAM)geostatistical space-time model (Poggio et al., 2012) wich are even useful to fill temporal gaps.GAM utilization for regression model fitting widely demonstrated in literature (Hua et al., 2021).Applying GAM on the daily data permits to perform data aggregation during the subsequent steps of the work.
Being able to manipulate the dataset starting with the DoY is an advantage in the procedure's logic and actually an efficient way to proceed.Computational times are stretched since the range of values within which the data can be interpolated (from 1 to 365) is greater.However, a single subsequent aggregation step is necessary to compensate positively the times, thus allowing for easy variation in aggregation time (i.e.weekly or biweekly).As a result, the process proves to be both elastic and adaptable as required.Within the two study areas, differences in the produced model accuracy values are evident and substantial, making them coherent with those reported by (Pesaresi et al., 2020, Pesaresi et al., 2022).These results are related to the different complexity levels of vegetation phytocoenoses and land cover in general.In the Conero mount study area there are 4 classes identified, while in the Frasassi gorge study area 8 classes are defined.Those are situated in a much more complex geomorphological and topographical context.This emphasizes the need to test the described methodologies in different and diverse contexts, so as to further assess their reliability.From this standpoint, this work has succeeded in providing a suitable comparative analysis among 4 approaches for time-series preprocessing.The processes involved can be replicated in other areas in order to enhance and validate the mapping accuracy.

CONCLUSION
This method is indeed proven to be fast and efficient.In this paper, 4 time series preprocessing approaches were compared, combining 2 outlier detection methods (tsclean function forecast package and boxplot function graphics package) and 2 interpolation algorithm application methods (GAM on aggregate data and GAM on daily data).Therefore, this research intended to stress the preprocessing part of the data that will be subjected to FPCA to identify which of the proposed methodologies performed best in terms of outputs and computational time.From the results obtained, the approach which involves tsclean function and GAM applied to daily data (M03) is ameliorative to the logic of the procedure and leads to better model performance in terms of Overall Accuracy.Although the algorithms implemented with the GAM have demonstrated the ability to adequately interpolate aggregate and daily data, the application of other techniques is also desirable to improve the construction of the time series.Other solutions, in the outlier detection phase, will be subject to further analysis since there are several methodologies which can be applied to clean the time series.As a result of the results obtained and the identification of this methodological approach for mapping, it will therefore be possible to periodically repeat these tests to produce maps up-to-date and, thus, to comply with EU regulations.

Figure 3 .
Figure 3. Workflow for the overall accuracy evaluation, combining the dataset to be subjected to one of the 4 pre-processing approaches.

Figure 4 .
Figure 4. NDVI values for a randomly chosen pixel which fit differently in the function in the different approaches, the black dots are the single pixel values recorded in the 4 years satellite imagery used in this paper.

Figure 8 .
Figure 8. Best model classification obtained in Frasassi gorge, using M03 approach with SCR and RI as vegetation index.

Figure 9 .
Figure 9. Best model classification obtained in Conero mount, using M03 approach with SCR and NDRE as vegetation index.

Table 1 .
Median accuracy values from models comparing approaches and predictors.

Table 2 .
Median accuracy values obtained from models comparing Approaches and vegetation indices.

Table 3 .
Median accuracy values obtained from models comparing approaches and study areas.