Modelling soil carbon fractions with visible near-infrared (VNIR) and ⁎ N.M. Knox
by user
Comments
Transcript
Modelling soil carbon fractions with visible near-infrared (VNIR) and ⁎ N.M. Knox
Geoderma 239–240 (2015) 229–239 Contents lists available at ScienceDirect Geoderma journal homepage: www.elsevier.com/locate/geoderma Modelling soil carbon fractions with visible near-infrared (VNIR) and mid-infrared (MIR) spectroscopy N.M. Knox a,b,c, S. Grunwald c,⁎, M.L. McDowell d,1, G.L. Bruland d,2, D.B. Myers c,3, W.G. Harris c a Earth Observation Division, South African National Space Agency (SANSA), PO Box 484, Silverton 0127, South Africa University of KwaZulu-Natal, School of Agricultural, Earth and Environmental Sciences, Private Bag X01, Scottsville 3209, Pietermaritzburg, South Africa Soil and Water Science Department, University of Florida, 2181 McCarty Hall, PO Box 110290, Gainesville, FL 32611, USA d Natural Resources and Environmental Management Department, University of Hawai'i Mānoa, 1910 East–West Rd, Sherman 101, Honolulu, HI 96822, USA b c a r t i c l e i n f o Article history: Received 4 March 2014 Received in revised form 15 October 2014 Accepted 26 October 2014 Available online 9 November 2014 Keywords: Partial least squares regression Random forest regression Total carbon Soil organic carbon Hydrolysable carbon Recalcitrant carbon Chemometric modelling a b s t r a c t Accurate assessment of soil carbon fractions would provide valuable contributions towards monitoring in ecological observatories, assessment of disturbance impacts, global climate and land use change. The majority of chemometric modelling studies have focused on measuring only total soil carbon (C), with only a few evaluating individual soil C pools. Analysis of pools allows for a more detailed picture of ecosystem processes, specifically decomposition and accretion of C in soils. This study evaluated the potential of the visible near infrared (VNIR), mid infrared (MIR) and a combined VNIR–MIR spectral region to estimate and predict soil C fractions. Partial least squares regression (PLSR) and random forest (RF) ensemble tree regression models were used to estimate four different soil C fractions. The soil C fractions analysed included total — (TC), organic — (SOC), recalcitrant — (RC) and hydrolysable carbon (HC). The sample set contained 1014 soil samples collected across the state of Florida, USA. Laboratory analysis revealed the wide range of total and organic C values, from 1 to 523 g·kg−1, with only about 10% of the samples containing inorganic C which was therefore omitted from the study. Both PLSR and RF modelling were shown to be effective in modelling all soil C fractions, with as much as 94–96% of the variation in the TC, SOC and RC pools, and 86% of HC being explained by the models. Although both PLSR and RF models were successful in modelling C fractions, RF models appear to target the physical properties linked to the property being analysed, and may therefore be the better modelling method to use when generalising to new areas. This study demonstrates that diffuse reflectance spectroscopy is an effective method for non-destructive analysis of soil C fractions, and through the use of RF modelling a spectral range between 2000 and 6000 nm should suffice to model these soil C fractions. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Soil carbon (C) sequestration has been targeted by the Intergovernmental Panel on Climate Change (IPCC) as one of the main means to mitigate rise in atmospheric C levels (Bellon-Maurel and McBratney, 2011; Metz et al., 2007). Total soil C can be separated into different soil C fractions based on their turnover rates. Whether the soil is a source or a sink of C depends on the balance between the C recalcitrance Abbreviations: DRS, diffuse reflectance spectroscopy; HC, hydrolysable carbon; MIR, mid infrared; PLSR, partial least squares regression; RC, recalcitrant carbon; RF, random forest regression; RMSE, root mean square error; RPD, residual prediction deviation; SOC, soil organic carbon; SWIR, short wave infrared; TC, total carbon; VNIR, visible near infrared. ⁎ Corresponding author. E-mail address: sabgru@ufl.edu (S. Grunwald). 1 Present Address: Scitor Corporation, 385 Moffett Park Dr, Sunnyvale, CA 94089, USA. 2 Present Address: Principia College, Biology and Natural Resources Department, Elsah, IL 62028, USA. 3 Present Address: Division of Plant Sciences, University of Missouri, 214 Waters Hall, Columbia, MO 65211, USA. http://dx.doi.org/10.1016/j.geoderma.2014.10.019 0016-7061/© 2014 Elsevier B.V. All rights reserved. (chemical stability of the soils) and evolution of C cycling processes (i.e. the impact of decomposition, accretion, etc.). There has been much debate about the impact of climate warming on soil C (Bellamy et al., 2005; Craine and Gelderman, 2011; Davidson and Janssens, 2006; Fierer et al., 2005; Thornley and Cannell, 2001; Trumbore et al., 1996) and the uncertainty involved in monitoring soil C across large regions is still substantial (Grunwald et al., 2011). There is in addition a need to assess the impact of land use shifts and disturbances (e.g., wildfire) on soils; however a rapid, robust and effective means to monitor changes in soil C at regional, continental and global scales is lacking. As a response to these needs research on spectral-based methods has exploded over the last decade and introduced new avenues including soil C models informed by proximal sensors, such as visible near infrared (VNIR) and mid infrared (MIR) diffuse reflectance spectroscopy (DRS). Importantly, these methods have been shown to yield comparable results to traditional time-consuming and more costly laboratory methods (BellonMaurel and McBratney, 2011). Diffuse reflectance spectroscopy in the VNIR spectral range has been used widely to characterize soil organic carbon (SOC) and soil total 230 N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 carbon (TC) (Brown, 2007; Ge et al., 2011; Sarkhot et al., 2011; Shepherd and Walsh, 2002; Udelhoven et al., 2003; Vasques et al., 2008, 2009, 2010). Soil OC models derived from VNIR spectra differed widely with studies covering 8.1–19.8 g kg− 1 (Viscarra Rossel et al., 2006), 6.1–33.0 g kg−1 (Reeves et al., 2002), 2.3–55.8 g kg− 1 (Shepherd and Walsh, 2002), 1.3–285.8 g kg− 1 (Chang et al., 2001), and 0.1–536.8 g kg−1 (Brown et al., 2006) soil C suggesting a more universal applicability of models the wider their C content range. Midinfrared spectroscopy has been demonstrated to predict SOC and TC, often with better accuracy than VNIR derived models. For example, McDowell et al. (2012a) predicted TC from VNIR and MIR spectral data on Hawaiian soils that are highly diverse and complex in terms of their mineralogy and soil texture. In the study by McDowell et al. (2012b) it was demonstrated that TC predictions derived from VNIR and from MIR spectra yielded robust results with R2 values of 0.94 or greater, root mean square prediction error (RMSE) values ranged from 2.28%–3.08%, and residual prediction deviation (RPD) of 3.38 or more. In their study the models derived from VNIR and MIR spectra were contrasted, with both spectral regions not being combined to form single spectra. Other studies, such as by McCarty et al. (2002), Reeves et al. (2006), and Reeves (2010) have demonstrated the success of MIR to predict soil C. The majority of these studies have focused on measuring only total soil C, with only a few evaluating individual soil C pools. Yet these pools allow a more detailed picture of ecosystem processes, specifically decomposition and accretion of C in soils. Thus, pools (i.e., chemically extracted soil C fractions) have unique capabilities to reflect the evolution of soil C as a result of various imposed stressors (e.g., climate and land use change). According to Belay-Tedla et al. (2009) C and nitrogen (N) fluxes are largely controlled by the small but highly bio-reactive, labile pools of these elements in terrestrial soils, while long-term C and N storage is determined by the long-lived recalcitrant fractions. In their study they found that recalcitrant and total C and N contents were not significantly affected by warming, however, labile and microbial biomass C fractions showed significant interactions with warming. Labile C, although often a small fraction of SOC, significantly affects heterotrophic respiration at short timescales because of its rapid decomposition (Gu et al., 2004). In a study by Bradford et al. (2007) it was shown that there are significant interactive effects when the formation and decomposition responses of different soil C fractions are considered. Hence, the total amount of soil C stocks was dependent on the rate of labile soil C, N and phosphorus (P) inputs to soils. Labile soil C has also been shown to respond to effects in management (Biederbeck et al., 1994). Proximal sensing technology has the potential to infer not only on total soil C content, but also on soil C fractions. However, this research is still in its infancy. For example, Vasques et al. (2009) used VNIR spectra to estimate various soil C fractions and mineralizable C in a subtropical region dominated by sand-rich soils. Also Sarkhot et al. (2011) estimated SOC, inorganic C, and labile C in a small floodplain area on clay-rich soils in Texas. More work needs to still be done in the field of chemometric modelling of C fractions encompassing diverse soil types to determine whether spectral modelling would be a robust method for predicting of soil C fractions. Recently, Bellon-Maurel and McBratney (2011) critically reviewed both NIR and MIR spectroscopic techniques for soil C studies. Their review highlighted that MIR spectroscopy yielded better (10–40% improvement) models than models developed from NIR spectra. Of the studies reviewed none of them compared NIR and MIR spectroscopy on the same soil samples indicating a critical research gap which is addressed in this study. From the studies they reviewed, they applied either multiple linear regression (MLR) or partial least squares regression (PLSR) for their chemometric models. The simpler MLR models were shown to perform comparably to the PLSR models; however, both methods showed shortcomings in terms of wavelengths selected. Judicious selection of wavelengths is imperative when applying a model to new areas where the soil C range might be similar, but the soil characteristics differ from those where the model was built. Wavelength selection is likely to be an important contributor to increasing the bias in results when the validation dataset is acquired from an “out of field” source. Both methods were found to be biased when applied to new soil samples with different compositions. Recently, the use of non-linear modelling methods, such as ensemble tree regression modelling, has been shown to improve model performances in terms of output error metrics (Vasques et al., 2009, 2010; Viscarra Rossel and Behrens, 2010; Vohland et al., 2011). The focus of this study was to determine the potential of both the VNIR and MIR spectral regions to estimate soil C fractions. Partial least squares regression (PLSR) and random forest (RF) regression models were derived to estimate soil C fractions. These methods were evaluated firstly to determine which method could best predict soil C fractions, and secondly it was determined if either method produced output that is attributed to the physics of soil C properties. We assume that if the most important model factors (i.e., the importance factors for RF and loadings for PLSR) can be attributed to physical soil C properties the models would be robust if generalised and applied to other areas in the future. The soil C fractions focussed on in this study include total C (TC), soil organic C (SOC), recalcitrant C (RC), and hydrolysable C (HC). VNIR and MIR chemometric models were derived from a wide selection of soil samples collected over the state of Florida, USA. The study objectives were to determine the predictive capabilities of the VNIR, MIR and combined VNIR–MIR spectral datasets for predicting soil C fractions (TC, SOC, RC, and HC). Within this overall objective we evaluated the following sub-objectives: 1. Determine which spectral region performs best. 2. Evaluate which modelling method (PLSR or RF) performs best based on validation datasets. 3. Identify if there is a best method in terms of data preparation for C fraction chemometric analysis. 2. Methods 2.1. Field sampling A total of 1014 sample sites were located over the state of Florida using a stratified random sampling design. The strata were defined based upon soil suborder data obtained from the Soil Data Mart — Soil Survey Geographic Database (SSURGO) (Natural Resources Conservation Service — http://soildatamart.nrcs.usda.gov/), and a reclassified land use/cover map derived from the 2003 Florida Vegetation and Land Cover Data map prepared by the Florida Fish and Wildlife Conservation Commission, Tallahassee, Florida. Dominant soils in the state are Spodosols (32%), Entisols (22%), Ultisols (19%), Alfisols (13%), Histosols (11%), and Mollisols and Inceptisols together occupy b3% of the state (Vasques et al., 2010). The soil samples were collected over a period of 1.25 years between 2008 and 2009. Sample sites were located using a differential global positioning system. At each site four 20 × 5.8 cm surface soil cores were collected from within a 2 m diameter area. These four cores were bulked and cooled in the field and then transported and processed in the lab. In addition to the sampled soils, data was collected of the soil suborder and land use practice found at each site. 2.2. Laboratory methods The bulk samples were air-dried and sieved to retrieve the fine earth fraction (b 2 mm), thoroughly mixed and then stored. Subsamples were taken from these dried, sieved and mixed bulk samples for the laboratory analysis. A portion of these subsamples were ball milled for use in some of the laboratory procedures outlined below. N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 2.2.1. Chemical analysis Chemical analysis for deriving the respective C fractions was performed on all the soil samples and included a 5% replication. The C fractions were measured using a Shimadzu TOC-VCPN catalytic combustion oxidation instrument with a SSM-5000a solid sample module (Shimadzu Scientific Instruments, Kyoto, Japan), following their different pre-processing. Total C (TC) was measured on 80–700 mg ball milled samples combusted at 900 °C. Inorganic C (IC) was derived by CO2 evolution from a reaction of 20–250 mg ball milled samples with 42.5% phosphoric acid at 200 °C. Soil organic C (SOC) was calculated by subtraction of IC from the TC. Hot water extractable ‘labile’ C (hydrolysable carbon — HC) was measured by incubating 4 g of soil in 40 mL (1:10) of double de-ionized water for 16 h at 80 °C. Samples were then filtered to 0.22 μm. The non-hydrolysable ‘recalcitrant’ C (RC) was measured by digesting 2 g of the ball milled soil in 10 mL of 5 M HCL under reflux conditions for 16 h. The soil digest was washed 3 times by centrifuge, dried and the remaining undigested C was then combusted at 900 °C. A Spearman's correlation was run to compare the relationships between the different C fractions. This was used as a preliminary comparison to determine the potential similarity in the derived chemometric models. 2.2.2. Spectral data acquisition and pre-model processing 2.2.2.1. VNIR. Soil samples were scanned with an Analytical Spectral Devices Lab Spec 5000 (Analytical Spectral Devices, Boulder, CO). The spectral range of the instrument covers 350 to 2500 nm, with a spectral resolution of 3 nm in the VNIR detector, and 10 nm in the two short wave infrared (SWIR) detector ranges. The sampling range is 1.377 nm in the VNIR range and 2 nm in the SWIR range, producing a spectrum length of 2151 points. The soil samples (b2 mm) were volumetrically controlled (5.5 cm3) in 46 mm petri dishes and oven dried at moderate heat (about 45 °C) for at least 12 h before scanning. The samples were cooled for an hour prior to taking the spectral measurements. Samples were scanned from above at four positions (90° rotations) using a custom designed rotating sample stage. A Spectralon (Labsphere, North Sutton, NH) white reference panel measurement was taken, followed by 25 scans in one of the four directions, the sample was then rotated, and the same procedure followed. After all four directions had been sampled internally the 100 measured spectra were then averaged to produce a single spectrum per sample. The calibrated Spectralon reference panel allows for the measured radiance measurements to be converted into relative reflectance values. 2.2.2.2. MIR. Mid-infrared DRS spectra were collected from ball milled subsamples using a Scimitar 2000 FTIR spectrometer (Varian, Inc., now Agilent Technologies, Santa Clara, CA) outfitted with a diffuse reflectance infrared Fourier transform (DRIFT) accessory. The instrument measures reflectance over the range of 400–6000 cm− 1 (1666– 25,000 nm) with a sampling interval of 2 cm−1 and spectral resolution of 4 cm−1. Each spectrum was the averaged result of 32 co-added scans. KBr powder in the first of the eight sample holders provided a background spectrum that was subtracted from each scan to correct for atmosphere and instrument effects. A new background spectrum was collected at the beginning of a session and between every seven samples. The KBr correction highlighted two narrow regions of the spectrum that were affected by fluctuating atmospheric conditions and exhibited residual atmospheric features. These regions fell between 1350–1419 cm−1 and 2281–2449 cm−1 and were therefore excluded from further analysis and interpretation. 2.2.2.3. Outlier detection. Outlier detection was performed on both the VNIR and MIR spectra separately prior to modelling. The assumption made in the outlier detection was that a spectrum should resemble another spectrum taken within the same soil suborder. Spectra collected 231 within the VNIR spectral range were grouped into their respective soil suborders. For each soil suborder a mean spectrum ±2 standard deviations from this mean was calculated. Each individual spectrum of the respective soil suborder was then compared against the spectral range defined for the suborder. If greater than 75% of this spectrum fell outside the entire spectral range it was considered to be an outlier and was excluded from further modelling. The same outlier detection process was applied to the MIR spectra. 2.2.2.4. VNIR & MIR fusion. Prior to data fusion the MIR spectral range measured in wave numbers (cm− 1), was converted into nanometre (nm) wavelength units (nm = 10,000,000/wave number). The first stage of fusing the spectral data involved matching the respective sample sites that contained spectra from both spectral regions, bearing in mind that some samples were removed during outlier detection. The VNIR and MIR spectral ranges overlap between 1666–2500 nm (SWIR region), providing the opportunity to compare the spectral behaviour between these two instruments. Across this overlapping spectral range it was found that the MIR spectra were on average between 10 and 30% brighter (i.e., higher reflectance value) than those recorded in the VNIR range. This difference therefore had to be accounted for when fusing the two spectra. Typically reflectance values of soil spectra fall below 10% in the MIR spectral region between 5000 and 7500 nm (2000–1330 cm−1), it was decided to adjust the VNIR range upwards, thus preventing any region of the fused spectrum falling below a reflectance value of 10%. Per paired spectrum (i.e., VNIR + MIR spectrum for a sample site) the mean difference across the overlapping region was calculated, and this mean difference value was then added to the VNIR spectrum, thus adjusting these VNIR spectra upwards. Spectra acquired in the SWIR region from the VNIR-DRS instrument had a higher signal to noise ratio compared to the spectrum acquired from the MIR-DRS instrument in this region of the spectrum, as a result of this region falling into the edge of the detector limits. In fusing the adjusted VNIR and the MIR data, it was decided to use the entire VNIR-DRS spectrum, and fuse it to the remaining MIR spectrum from 2500 to 25,000 nm. The end results of this spectral fusion process were integrated spectra covering the spectral range from 350 to 25,000 nm (Fig. 1). To reduce the modelling processing time applied to these fused spectra the spectra were resampled to 5 nm increments across the spectral range. 2.3. Carbon fraction modelling The three spectral datasets were used to create regression models for the prediction of soil C fractions. In order to test the predictive capabilities of the developed models the datasets were split into a calibration/validation (70/30%) set. The data split was based upon a stratified random selection of samples within each of the field sampling strata. Samples were assigned to the calibration/validation split prior to the outlier detection and VNIR & MIR data fusion. Following the removal of samples from the outlier detection and VNIR & MIR data fusion the calibration dataset contained 696 samples and the validation set contained 296 samples. All C fraction values were log transformed prior to modelling. The presented results are based on the respective log transformed C values. Two different chemometric modelling methods were applied in the spectral analysis to test their abilities in modelling and predicting different soil C fractions. Partial least squares regression, a linear modelling technique frequently applied in chemometric analysis (Geladi and Kowalski, 1986; Mevik and Wehrens, 2007; Wehrens, 2011), and RF ensemble tree regression (Breiman, 2001; Seni and Elder, 2010; Wehrens, 2011) a newer non-linear modelling technique, were both applied. Background noise is inherent within spectral data. The effects of noise filtering and preprocessing methods was tested following the application of a variety of different background scatter removal processing techniques, prior to modelling. Nine common scatter correction methods were applied (Wehrens, 2011) and compared to reflectance 232 N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 Fig. 1. Mean combined VNIR–MIR spectrum for each of the soil orders collected in this study. spectra that had only been averaged with no further processing applied (RAW). The methods included: multiplicative scatter correction (MSC), MSC followed by a 5 window smoothing Savitzky–Golay filter applied to 1st derivative spectra (MSC-1st D), standard normal variate (SNV) correction, 5 window smoothing Savitzky–Golay filter (SG), 5 window quadratic smoothing Savitzky–Golay filter (SG-Quad), 1st derivative-5 window smoothing Savitzky–Golay filter (SG-1st D), 1st derivative-5 window quadratic smoothing Savitzky–Golay filter (SG-1st D-Quad), absorbance (log 10(1/x), where x is the reflectance value), and transformation to absorbance and then application of the 1st derivative-5 window smoothing Savitzky–Golay filter (log 10 (1/x) SG-1st D). Prior to regression model implementation the spectral data were mean centred. The regression modelling was implemented in R (R Development Core Team, 2010), using the packages “pls” (Mevik and Wehrens, 2007) for the PLSR modelling and the package “randomForest” (Liaw and Wiener, 2002) for the RF modelling. For the PLSR modelling, a maximum limit on latent variables to be used for the models was set at 20. For each developed model the number of latent variables (NLV) selected was based upon the selection of the first minima defined in the prediction residual error sum of squares (PRESS) or if no local minima occurred then the LV which produced the flattest slope between each PRESS value was selected (Mevik and Cederkvist, 2004). In addition to validation using the external validation dataset, an internal k-fold cross validation was performed using a random split into 15 segments. For RF modelling, following a pre-test (data not shown) it was verified that with 400 splits for the maximum of 500 trees in the forest produced comparable results to the default split value calculated as number of parameters/3 (Seni and Elder, 2010). Reduction in the number of splits resulted in a substantial reduction in processing time of the RF models. The developed models were then applied to the independent validation dataset and the predictive powers of the models were assessed in terms of the produced R2, root mean square error (RMSE), and residual prediction deviation (RPD) values (Williams, 1987). The RPD is defined as RPD = SD/SEP, where SD is the standard deviation of the dataset and SEP is the standard error of prediction. For each C fraction, spectral range and modelling method the model that reduced the error metrics the most was selected. The variable loadings/importance values were then evaluated and are discussed with respect to whether the selected model outcomes are based upon physical properties related to C and what effect this might have in implementing these models to new datasets from different regions. 3. Results and discussion 3.1. Laboratory analysis The summary statistics of the C fractions for the samples used in the spectral modelling are presented in Table 1. In terms of C variability, excessively well drained quartz-dominated, marine sands have low biomass production rates relative to mineralization rates and hence are low in C. Marshes and wetland areas, on the other hand, are high in C due to high productivity and anaerobic conditions (Vasques et al., 2012). The soils in this region cover a wide range of the attribute feature space with TC values ranging from about 1 to 523 g·kg−1 making this a unique testing ground for this study (Table 1). In comparison with previous spectral studies conducted in different areas the range of total and organic C is very broad (Janik et al., 1998; McCarty et al., 2002; Vasques et al., 2008). The samples taken across Florida contain very low inorganic C content, resulting in similar content levels in the total and organic C fractions (Table 1). As a result of only a few of the samples containing IC, this C pool was omitted from the study. The analysis shows that the bulk of the C within Florida is concentrated within the stable recalcitrant pool, as measured by RC, compared to the low concentrations of C found in the labile C pool, as measured by HC. Nevertheless, the labile carbon pool is critical because it represents the dynamic component of the C cycle and measurements of HC can indicate effects due to land use, climate, moisture or other landscape changes more rapidly than measurements of TC or SOC (Bradford et al., 2007; Hu et al., 1997). The correlation analysis (Table 2) between the different C fractions shows that in Florida TC, SOC and RC levels in the soil are highly Table 1 Summary statistics for laboratory analysis derived carbon fractions divided into the calibration and validation sets. All fractions are presented in g kg−1. Carbon fraction Total C (TC) Organic C (SOC) Recalcitrant C (RC) Hydrolysable C (HC) Calibration (n = 696) Validation (n = 296) Min Max Mean Median Min Max Mean Median 1.33 1.33 0.67 523.27 523.27 502.07 32.00 31.54 21.13 11.65 11.48 6.18 3.43 3.43 1.93 430.64 430.64 375.70 37.94 37.21 26.87 11.77 11.56 5.83 0.05 19.24 0.88 0.53 0.11 9.30 0.91 0.52 N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 Table 2 Spearman's correlation analysis of the paired soil carbon fraction soil samples acquired over the state of Florida. Carbon fraction TC SOC RC Total C (TC) Organic C (SOC) Recalcitrant C (RC) Hydrolysable C (HC) HC 0.99 0.96 0.87 0.97 0.87 0.84 correlated, and HC is less related to these three other fractions. Based on this we expect that the chemometric models derived from TC, SOC and RC will be similar and that there will be considerable differences in the models derived for HC. Pre-processing had a greater impact on the modelling approach applied rather than the spectral region being considered. For PLSR modelling the RAW spectral data produced validation error metrics that were only slightly poorer than the best performing pre-processing method. This was irrespective of the carbon fraction or spectral region being considered (Table 3). In models developed using RF regression spectral preprocessing improved the predictive capability of models by an order of Table 3 Partial least square validation error metrics produced using the different spectral pre-processing methods applied to the spectral data from the VNIR, MIR and combined VNIR–MIR spectral regions of four log transformed soil carbon fractions (log g·kg−1). The highlighted cells are the methods that minimised the error metrics and refer to the models that were selected for the model comparisons that are discussed and highlighted in Figs. 1, 3 and 5, and Table 4. TC SOC RC HC Pre-processinga RAW MSC MSC-1st D SNV SG SG–Quad SG–1st D SG–1st D–Quad Log 10 (1/x) Log 10 (1/x), SG–1st D RAW MSC MSC-1st D SNV SG SG–Quad SG–1st D SG–1st D–Quad Log 10 (1/x) Log 10 (1/x), SG–1st D RAW MSC MSC-1st D SNV SG SG–Quad SG–1st D SG–1st D–Quad Log 10 (1/x) Log 10 (1/x), SG–1st D RAW MSC MSC-1st D SNV SG SG–Quad SG–1st D SG–1st D–Quad Log 10 (1/x) Log 10 (1/x), SG–1st D 2–8% (Table 4). For RF modelling it appears that a minimum application of smoothing on 1st derivative spectra will improve model performance (Table 4). From this comparative study no single “best” pre-processing method can be suggested for C fraction modelling using spectral processing. It is shown however, that MSC and SNV in general perform poorly (Tables 3 and 4), this contrasts from other studies that have shown these two pre-processing methods to improve their models (Cambule et al., 2012; McDowell et al., 2012a; Vasques et al., 2008). Given the similarity in concentrations of the TC and SOC fractions (only about 10% of the samples were different due to measurable IC) it is not surprising to see that the validation outcomes with respect to pre-processing are very similar for both the PLSR and RF modelling procedures (Tables 3 and 4). 3.3. Carbon fraction model evaluations 3.2. Pre-processing C Fractions 233 R2 0.86 0.76 0.79 0.78 0.86 0.86 0.86 0.86 0.81 0.61 0.85 0.75 0.81 0.78 0.85 0.85 0.85 0.87 0.82 0.58 0.82 0.72 0.75 0.77 0.82 0.81 0.82 0.83 0.85 0.65 0.81 0.73 0.72 0.74 0.81 0.81 0.8 0.8 0.82 0.68 VNIR RMSE 0.41 0.53 0.50 0.51 0.41 0.41 0.41 0.41 0.47 0.68 0.42 0.55 0.48 0.52 0.42 0.43 0.42 0.40 0.46 0.70 0.52 0.65 0.61 0.59 0.52 0.53 0.52 0.51 0.48 0.73 0.36 0.42 0.43 0.41 0.36 0.36 0.36 0.36 0.35 0.46 RPD 2.6 2.0 2.2 2.1 2.6 2.6 2.6 2.6 2.3 1.6 2.6 2.0 2.3 2.1 2.6 2.5 2.6 2.7 2.3 1.5 2.3 1.8 2.0 2.1 2.3 2.3 2.3 2.4 2.5 1.7 2.3 1.9 1.9 1.9 2.3 2.3 2.2 2.2 2.3 1.7 R2 0.95 0.91 0.87 0.93 0.95 0.95 0.94 0.94 0.95 0.94 0.95 0.91 0.87 0.93 0.95 0.95 0.94 0.93 0.96 0.94 0.94 0.91 0.86 0.93 0.94 0.94 0.92 0.92 0.94 0.92 0.86 0.84 0.74 0.84 0.86 0.86 0.79 0.78 0.86 0.80 MIR RMSE 0.24 0.33 0.41 0.29 0.24 0.24 0.27 0.27 0.24 0.27 0.24 0.33 0.41 0.28 0.24 0.24 0.27 0.28 0.23 0.27 0.30 0.36 0.45 0.32 0.31 0.30 0.35 0.35 0.30 0.33 0.31 0.33 0.42 0.33 0.31 0.31 0.38 0.38 0.30 0.37 RPD 4.5 3.3 2.6 3.8 4.5 4.5 4.0 4.0 4.6 4.0 4.5 3.3 2.6 3.8 4.5 4.5 4.0 3.9 4.7 4.0 4.0 3.4 2.6 3.7 3.9 4.0 3.4 3.4 4.0 3.6 2.6 2.4 1.9 2.4 2.6 2.6 2.1 2.1 2.6 2.2 R2 0.95 0.90 0.89 0.92 0.95 0.95 0.95 0.95 0.95 0.95 0.95 0.91 0.88 0.92 0.95 0.95 0.95 0.95 0.95 0.95 0.93 0.90 0.89 0.91 0.93 0.93 0.93 0.93 0.93 0.92 0.86 0.83 0.80 0.84 0.86 0.86 0.86 0.86 0.86 0.81 VNIR-MIR RMSE 0.24 0.35 0.37 0.31 0.24 0.23 0.24 0.24 0.25 0.25 0.24 0.33 0.38 0.31 0.23 0.23 0.23 0.23 0.23 0.24 0.31 0.38 0.39 0.36 0.31 0.32 0.31 0.31 0.31 0.33 0.30 0.33 0.36 0.33 0.30 0.30 0.30 0.30 0.30 0.35 RPD 4.5 3.0 2.9 3.5 4.5 4.6 4.5 4.5 4.3 4.3 4.5 3.2 2.8 3.4 4.6 4.6 4.6 4.6 4.7 4.4 3.9 3.1 3.0 3.3 3.9 3.7 3.8 3.9 3.8 3.6 2.7 2.4 2.2 2.4 2.7 2.7 2.7 2.7 2.6 2.3 a The different pre-processing techniques applied: RAW = averaged; MSC = multiplicative scatter correction; MSC-1st D = MSC followed by a 5 window smoothing Savitzky–Golay filter applied to 1st derivative spectra; SNV = standard normal variate correction, SG = 5 window smoothing Savitzky–Golay filter; SG-Quad = 5 window quadratic smoothing Savitzky–Golay filter; SG-1st D = 1st derivative-5 window smoothing Savitzky–Golay filter; SG-1st D-Quad = 1st derivative-5 window quadratic smoothing Savitzky–Golay filter; log 10(1/x) = absorbance (where x is the reflectance value); log 10 (1/x) SG-1st D = transformation to absorbance and then application of the 1st derivative-5 window smoothing Savitzky–Golay filter. We have shown that the various C fractions measured in this study can be predicted with high accuracy through the application of chemometric models. The best prediction models were capable of explaining up to 86% of the variation in the labile (HC) pool and as much as 94– 96% of the variation in the TC, SOC and RC pools (Tables 5 and 6). For the dataset studied we found that when comparing PLSR and RF as techniques for modelling C fractions, both methods produce comparable error metrics in all the spectral ranges studied, which is similar to the findings of McDowell et al. (2012a). When evaluating the error metrics in terms of the spectral ranges in their ability to model the separate C fractions we find that the MIR and VNIR–MIR spectral ranges produced C fraction models of higher predictive capabilities than models produced using only the VNIR range. This is in agreement with the findings of Reeves (2010), though the differences between the two spectral ranges were closer than he found. Combining the VNIR and MIR spectral range however, does not dramatically improve models when compared to using only the MIR spectral range (Tables 3/5 and 4/6). The cross-validation error metrics shown in both the PLSR and RF selected models (Tables 5 and 6) are lower than the validation (and calibration metrics for the PLSR models) metrics. Inherent in the RF modelling process is the application of an “Out of Bag” (OOB) crossvalidation technique similar to the internal k-fold cross validation that was performed in the PLSR modelling procedure. This internal crossvalidation method has been documented to produce on average lower error metrics for the test models (Wehrens, 2011), but produces robust output models, which are shown by the comparable results of the calibration and validation sets in the RF models. These comparable models indicate that they are not overfitted. This is further verified by the similarity in the validation error metrics produced by both the PLSR and RF models. The error metrics produced by the RF and PLSR modelling show that both methods produced models with similar predictive capabilities. Analysis of the spectral loadings or importance values generated during the RF and PLSR modelling showed that these methods highlight and accentuate different spectral properties and wavelengths in each of the spectral regions. Higher loading and the importance values, indicate variables that contribute more weight to the prediction models. A consistent finding, irrespective of the spectral range considered, is that the RF models have only a few specific wavelengths across the respective spectral ranges that are of high importance to the models. The important wavelengths identified for the different C fractions tend to coincide in the RF models; however, the relative intensity of these wavelengths differs. The PLSR wavelength loadings, in contrast, tend to be broader, and for all the spectral regions TC and SOC models are similar. In contrast, HC and RC show different wavelength loading properties; some were similar to the TC and SOC, and others were substantially different. Below we highlight the model spectral properties in each of the analysed spectral regions with respect to the modelling method applied and the C fraction modelled. 234 N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 Table 4 Random forest validation error metrics produced using the different spectral pre-processing methods applied to the spectral data from the VNIR, MIR and combined VNIR_MIR spectral regions of four log transformed soil carbon fractions (log g·kg−1). The highlighted cells are the methods that minimised the error metrics and refer to the models that were selected for the model comparisons that are discussed and highlighted in Figs. 2, 4 and 6, and Table 5. C Fractions TC SOC RC HC Pre-processinga RAW MSC MSC-1st D SNV SG SG–Quad SG–1st D SG–1st D–Quad Log 10 (1/x) Log 10 (1/x), SG–1st D RAW MSC MSC-1st D SNV SG SG–Quad SG–1st D SG–1st D–Quad Log 10 (1/x) Log 10 (1/x), SG–1st D RAW MSC MSC-1st D SNV SG SG–Quad SG–1st D SG–1st D–Quad Log 10 (1/x) Log 10 (1/x), SG–1st D RAW MSC MSC-1st D SNV SG SG –Quad SG–1st D SG–1st D–Quad Log 10 (1/x) Log 10 (1/x), SG–1st D R2 0.85 0.60 0.80 0.63 0.85 0.85 0.88 0.88 0.85 0.80 0.86 0.60 0.80 0.63 0.86 0.86 0.88 0.88 0.86 0.80 0.83 0.58 0.78 0.60 0.83 0.83 0.87 0.87 0.82 0.77 0.75 0.57 0.73 0.58 0.74 0.74 0.80 0.80 0.74 0.74 VNIR RMSE 0.42 0.70 0.51 0.69 0.42 0.42 0.39 0.39 0.42 0.49 0.41 0.70 0.51 0.68 0.41 0.41 0.38 0.38 0.41 0.49 0.51 0.80 0.60 0.79 0.51 0.51 0.46 0.46 0.52 0.58 0.41 0.53 0.42 0.52 0.41 0.41 0.36 0.36 0.41 0.41 RPD 2.6 1.5 2.1 1.6 2.6 2.6 2.8 2.8 2.6 2.2 2.7 1.5 2.1 1.6 2.6 2.6 2.8 2.8 2.6 2.2 2.4 1.5 2.0 1.5 2.4 2.4 2.6 2.6 2.3 2.1 2.0 1.5 1.9 1.5 2.0 2.0 2.2 2.2 2.0 2.0 R2 0.92 0.93 0.92 0.93 0.92 0.92 0.94 0.94 0.92 0.93 0.92 0.93 0.92 0.93 0.92 0.92 0.94 0.94 0.92 0.93 0.91 0.92 0.92 0.92 0.91 0.91 0.93 0.93 0.91 0.93 0.77 0.80 0.79 0.79 0.77 0.77 0.81 0.81 0.78 0.81 MIR RMSE 0.31 0.28 0.31 0.28 0.30 0.31 0.28 0.28 0.31 0.29 0.30 0.29 0.31 0.28 0.30 0.30 0.28 0.28 0.30 0.29 0.37 0.35 0.35 0.35 0.36 0.37 0.34 0.34 0.38 0.33 0.39 0.36 0.37 0.37 0.38 0.39 0.35 0.35 0.38 0.36 RPD 3.5 3.8 3.5 3.8 3.6 3.5 3.9 3.9 3.5 3.8 3.6 3.8 3.5 3.8 3.5 3.6 3.9 3.9 3.6 3.7 3.3 3.5 3.4 3.5 3.3 3.2 3.6 3.6 3.2 3.6 2.1 2.2 2.2 2.2 2.1 2.1 2.3 2.3 2.1 2.3 R2 0.92 0.92 0.94 0.93 0.91 0.92 0.95 0.95 0.92 0.95 0.92 0.92 0.94 0.92 0.92 0.92 0.94 0.94 0.92 0.95 0.91 0.91 0.93 0.91 0.91 0.91 0.93 0.93 0.91 0.93 0.77 0.80 0.85 0.80 0.77 0.77 0.85 0.85 0.77 0.85 VNIR-MIR RMSE RPD 0.31 3.5 0.3 3.6 0.26 4.2 0.29 3.6 0.31 3.4 0.31 3.5 0.25 4.2 0.25 4.2 0.31 3.5 0.25 4.3 0.30 3.5 0.30 3.6 0.27 4.0 0.29 3.6 0.31 3.5 0.30 3.5 0.26 4.1 0.26 4.1 0.30 3.5 0.25 4.2 0.37 3.3 0.35 3.4 0.33 3.6 0.36 3.3 0.36 3.3 0.36 3.3 0.32 3.7 0.32 3.8 0.37 3.2 0.32 3.7 0.38 2.1 0.36 2.2 0.32 2.5 0.36 2.2 0.39 2.1 0.39 2.1 0.31 2.6 0.31 2.6 0.39 2.1 0.31 2.6 a The different pre-processing techniques applied: RAW = averaged; MSC = multiplicative scatter correction; MSC-1st D = MSC followed by a 5 window smoothing Savitzky-Golay filter applied to 1st derivative spectra; SNV = standard normal variate correction, SG = 5 window smoothing Savitzky–Golay filter; SG-Quad = 5 window quadratic smoothing Savitzky–Golay filter; SG-1st D = 1st derivative-5 window smoothing Savitzky–Golay filter; SG-1st D-Quad = 1st derivative-5 window quadratic smoothing Savitzky–Golay filter; log 10(1/x) = absorbance (where x is the reflectance value); log 10 (1/x) SG-1st D = transformation to absorbance and then application of the 1st derivative-5 window smoothing Savitzky–Golay filter. 3.3.1. VNIR The behaviour of the TC and SOC wavelength loadings is distinctly different from that of RC and HC in the PLSR-VNIR models (Fig. 2). The RC and HC pools show broad wavelength loadings that coincide with the border between the O\H and C\H combination stretches, and a peak associated with the water absorption feature between 1850 and Table 5 Calibration, cross-validation and validation error metrics for the selected partial least square models (the models highlighted in Table 2) derived from the three spectral datasets modelled to derive and predict soil log transformed carbon fractions (log g·kg− 1) across Florida. Spectrum VNIR MIR VNIR– MIR C fraction Calibration R2 RMSE RPD R2 Cross-validation RMSE RPD R2 RMSE RPD TC SOC RC HC TC SOC RC HC TC SOC RC HC 0.85 0.85 0.80 0.81 0.96 0.95 0.91 0.84 0.96 0.96 0.91 0.88 0.40 0.39 0.49 0.36 0.21 0.23 0.34 0.33 0.20 0.21 0.33 0.29 0.44 0.45 0.52 0.39 0.25 0.26 0.36 0.36 0.23 0.24 0.36 0.32 0.41 0.40 0.48 0.35 0.24 0.23 0.30 0.30 0.23 0.23 0.31 0.30 2.6 2.6 2.3 2.3 4.3 4.4 3.3 2.5 5.0 4.8 3.4 2.9 0.81 0.80 0.78 0.78 0.94 0.94 0.90 0.81 0.95 0.94 0.90 0.85 2.3 2.2 2.1 2.1 4.0 4.0 3.1 2.3 4.4 4.2 3.1 2.5 Validation 0.86 0.87 0.85 0.82 0.95 0.96 0.94 0.86 0.95 0.95 0.93 0.86 2.6 2.7 2.5 2.3 4.6 4.7 4.0 2.6 4.6 4.7 3.9 2.7 1920 nm. In contrast, TC and SOC have a number of narrower peaks that coincide with known absorption features. A prominent peak at 1000 nm although falling within the broad N\H/O\H 2nd overtone stretch feature has also been associated with a small narrow Si\O stretch feature (Manchanda et al., 2002). Similar to the RC and HC wavelength loadings, prominent peaks occur at the border between the O\H and C\H combination stretch and the water feature between 1850 and 1920 nm. A third peak occurs in the overlapping zone between the C\H combination stretch and the N\H/O\H 1st overtone stretch. The peak that occurs at 1850 nm is right on the edge of the C\H 1st overtone stretch. Although no artefacts were observed in the collected spectra, both 1000 nm and 1850 nm wavelengths coincide with the overlap regions of the spectral detectors in the ASD instrument, and may therefore be artefacts caused by the spectroradiometer. The RF C fraction models within the VNIR range contrast significantly to the PLSR models (Fig. 3 vs 2). Unlike the PLS loadings the important wavelengths identified in the RF models form narrow peaks, rather than the broader peaks observed in the PLS loading plots. In the RF models, for TC, SOC and RC, the most prominent peak occurs in the orange region of the visible spectrum (590–620 nm), with some smaller peaks in the yellow region (570–590 nm). For these three fractions some smaller peaks are linked to N\H/O\H/C\H combination stretches between 2000 and 2500 nm. Hydrolysable C shares peak locations with the other three fractions, but unlike the other fractions a peak in the N\H/O\H combined stretch region has greater importance than the contribution of visible spectrum wavelengths. McDowell et al. (2012a) also found similar peaks in their RF models above 2000 nm for TC, however their strongest peaks were in this region rather than in the visible spectral region as we found in this study. 3.3.2. MIR The loading plots produced by the PLS-MIR models are noisy across the entire MIR spectral region (Fig. 4). Unlike the models created in the VNIR and VNIR–MIR spectral regions the TC and SOC models are different across most of the entire spectral range, however these two models do share the locations of the principal loading peaks. In the automated selection of the number of latent variables (NLV) for these four models 18, 15, 15 and 15 NLV were selected for TC, SOC, RC and HC, respectively, the three additional variables selected in the TC model result in the differences observed between the TC and SOC. Similar to the results seen with the VNIR spectral range, all C fraction loading peaks occur in the spectral range associated with O\H bonds, in this instance with the O\H fundamental bond vibrations. The peak observed at 7500 nm for all C fractions has not been associated with any soil bond vibrations, and needs to be investigated further to determine if this is a shift of C\O stretch feature (although a second smaller feature does occur within the C\O stretch region), or in fact relates to another property. The multiple smaller loading peaks across the spectral range between 12,000 and 25,000 nm are of interest in the PLSR models. This region is associated with Si\O bond vibrational stretches and would tie in with the high quartz content found in the majority of the Florida soil samples. The variable importance plots of the four C fractions produced by RF modelling again shows completely different behaviour to that seen in the PLSR modelling. Only a small component of the MIR spectral range contributes to modelling the four C fractions (Fig. 5). In contrast to the PLSR models there are no important variables associated with the Si\O bond features, and the important variables are associated directly with C bond features. Similar to what was observed in the VNIR spectral range, HC does not have as many wavelength-variable peaks that contribute to its model. In the MIR range the HC importance features are dominated by 3 narrow peaks linked to C_O stretching. Though these are the dominant variables linked to TC, SOC and RC, they also include features in the C`C absorption region. In the C`C region RC has only a single peak, compared to the two peaks displayed by TC and SOC. The locations of the peaks in the region of 5500 nm are similar to N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 Table 6 Cross-validation and validation error metrics for the selected Random Forest models (the models highlighted in Table 3) derived from the three spectral datasets modelled to derive and predict soil log transformed carbon fractions (log g·kg−1) across Florida. In addition based on the random forest analysis of VNIR–MIR it was identified that the region 2000–6000 nm could be used to model the different C fractions, the final row provides the validation findings for the different carbon fractions run on this subset of the combined spectrum. Spectrum C fraction Cross-validation R2 VNIR MIR VNIR–MIR 2000–6000 nm TC SOC RC HC TC SOC RC HC TC SOC RC HC TC SOC RC HC RMSE 0.84 0.41 0.84 0.41 0.79 0.51 0.79 0.38 0.92 0.30 0.91 0.31 0.88 0.39 0.77 0.41 0.92 0.29 0.92 0.29 0.89 0.37 0.84 0.34 Log 10 (1/x) SG-1st Da Log 10 (1/x) SG-1st Da SG-1st Da SG-1st D-Quada Validation RPD R2 RMSE RPD 2.4 2.4 2.2 2.1 3.4 3.3 2.8 2.0 3.5 3.5 3.0 2.4 0.88 0.88 0.87 0.80 0.94 0.94 0.93 0.81 0.95 0.95 0.93 0.85 0.95 0.94 0.93 0.85 0.39 0.38 0.46 0.36 0.28 0.28 0.33 0.35 0.25 0.25 0.32 0.31 0.25 0.27 0.33 0.31 2.8 2.8 2.6 2.2 3.9 3.9 3.6 2.3 4.3 4.2 3.8 2.6 4.2 4.0 3.6 2.6 a The pre-processing steps providing the highest calibration and validation output for the 2000–6000 nm spectral range: SG-1st D = 1st derivative-5 window smoothing Savitzky–Golay filter; SG-1st D-Quad = 1st derivative-5 window quadratic smoothing Savitzky–Golay filter; log 10 (1/x) SG-1st D = transformation to absorbance and then application of the 1st derivative-5 window smoothing Savitzky–Golay filter. 235 When the VNIR and MIR ranges are combined for the PLSR modelling (Fig. 6), none of the features that were prominent in the PLSR models generated for the VNIR region (Fig. 2) are prominent in the combined spectrum. The combined range loadings are however smoother than those presented in the loading plot produced by using only the MIR (Fig. 4). Comparison of the MIR (Fig. 4) and VNIR–MIR (Fig. 6) loading plots shows similar loading peaks, with the most prominent peak occurring at 2700 nm associated with O\H fundamentals, a double feature in C\O stretch region and the broad double loading peaks in the Si\O stretch regions. A feature between 3950 and 4000 nm has also been related to O\H stretching, and occurs in both the MIR and VNIR–MIR loading plots. In the VNIR–MIR region the automated PLSR NLV selection procedure for the four C fractions resulted in 18, 17, 13 and 18 NLV being selected for the TC, SOC, RC and HC models, respectively. Similar to what is seen in the VNIR and MIR RF importance plots (Figs. 3 and 5), in the VNIR–MIR importance plot only a very small region of the entire spectrum shows variables that were important to the RF modelling of the C fractions (Fig. 7). The MIR and VNIR–MIR RF models shared two prominent features located in the C_O stretch and the C`C stretch region (Figs. 5 and 7). The VNIR–MIR HC model differed from its MIR model by containing peaks in two regions, the one in the same C_O stretch region and the second located in the NIR at a feature associated with a combined O\H/N\H bond stretch (Figs. 5, 6 and 7). Unlike the MIR models the VNIR–MIR models for TC, SOC and RC in addition also display an important variable associated with the edge of the O\H stretch bond feature at 4000 nm (Figs. 5 and 7). 4. Conclusions those seen in McDowell et al. (2012a), though in their study they did not observe peaks within the C`C spectral feature. 3.3.3. VNIR–MIR This is the first study to the authors' knowledge that looks at the impacts on C modelling resulting from fusing the VNIR and MIR spectral regions into a single spectrum. In creating a combined spectrum it becomes possible to see the role of the fundamental versus overtone features in the chemometric models. Both the PLSR and RF models show that the wavelengths contributing most to their respective models are all located above 2000 nm with little contribution from the visible through to near infrared region below 2000 nm. Most chemometric studies of soil C modelling have focused on generating models to estimate total or organic C concentrations. In this study, by contrast, we have shown that chemometric modelling is useful to accurately predict soil C fractions. This approach could contribute towards monitoring C for ecological observatories, C trading, assessing disturbance impacts, as well as assessing the effects of global climate and land use change on soils. Total C, SOC and RC are all highly correlated in Florida soils. We have shown that models generated for TC and SOC are in most instances nearly identical. Although RC is highly correlated to TC and SOC the models generated for RC produce slightly poorer error metrics, and particularly when modelling using PLSR, the RC model loadings differ from those generated in the TC and SOC models. Fig. 2. Absolute summed loading plots for the identified latent variables from the selected partial least squares regression models in the VNIR spectral range for each of the carbon fractions. Highlighted on the plots are spectral regions of known band absorption features related to carbon containing compounds that coincide with identified loading peaks (Ben-Dor et al., 1999; Manchanda et al., 2002; Stuart, 2004). 236 N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 Fig. 3. Importance values for the selected random forest models in the VNIR spectral range for each of the carbon fractions. Highlighted are spectral regions of known band absorption features related to carbon containing compounds that coincide with importance peaks (Ben-Dor et al., 1999; Manchanda et al., 2002; Stuart, 2004). Deviation of HC in Florida in terms of the correlation, error metrics and output model loadings indicates that this C fraction warrants development of its own individual chemometric models. The two broad aims of this study were firstly to perform a comparison between the VNIR and MIR spectral regions and assess their ability to model and estimate the different fractions of soil C, and secondly to evaluate the performance of two different chemometric modelling methods in modelling and estimating these soil C fractions. Conclusions based on these aims are highlighted below. 4.1. Spectral region comparison for modelling soil carbon fractions We have found that the MIR region produces slightly better results than the VNIR region, which is similar to earlier findings presented in Bellon-Maurel and McBratney (2011). In addition, we have found that in a combined spectrum the region above 2000 nm contributes most to modelling of C fractions. If RF modelling is used it should be possible to model C fractions using only the spectral regions between 2000 and 6000 nm. Based on this finding we ran an RF analysis on our dataset to determine if the spectral region from 2000 to 6000 nm will in fact enable modelling of the different soil carbon fractions with a comparable accuracy to that of the entire combined spectrum. In Table 5 we provide the validation findings from this analysis. This region provided comparable results to using the entire VNIR–MIR combined spectrum and confirms that this spectral region alone can be used to predict soil carbon fractions. Through this spectral fusion approach one is able to balance the decreased signal to noise ratio (SNR) experienced at the edge of both spectroradiometric instruments, and capture the spectral behaviour associated with the fundamental overtone region of the spectrum. Bellon-Maurel and McBratney (2011) concluded through the analysis of multiple studies that models within the VNIR region identified the range of 1650–2500 nm as the best range for C modelling. This study found that using either PLS or RF modelling the region above 2000 nm was most important. We speculate that inclusion of visible spectrum is most likely location/area specific, as it would relate to colour of the soil, and its inclusion in models would reduce their ability to generalise to new areas. The robustness of models in the validations suggests that models could be applied to new soil spectra to predict soil carbon Fig. 4. Absolute summed loading plots for the identified latent variables from the selected partial least squares regression models in the MIR spectral range for each of the carbon fractions. Highlighted are spectral regions of known band absorption features related to carbon containing compounds that coincide with identified loading peaks (Ben-Dor et al., 1999; Manchanda et al., 2002; Stuart, 2004). N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 237 Fig. 5. Importance values for the selected random forest regression models in the MIR spectral range for each of the carbon fractions. Highlighted are spectral regions of known band absorption features related to carbon containing compounds that coincide with importance peaks (Ben-Dor et al., 1999; Manchanda et al., 2002; Stuart, 2004). A reduced spectral range is displayed in this plot to highlight the regions where peaks are seen, the remainder of the MIR spectrum is relatively flat. fractions, assuming that soil samples were collected from within the soil-geographic domain covered in this dataset. 4.2. Modelling methods Models derived from PLSR in the MIR and VNIR–MIR range have slightly higher RPD values and slightly lower RMSE values compared to the results obtained from the RF modelling for all C fractions. On the other hand in the VNIR range, with the exception of HC, RF modelling produced slightly better results in terms of higher RPD and lower RMSE values (Tables 5 and 6). A one-dimensional interpretation of these findings would lead one to conclude that PLSR as a modelling method outperforms RF modelling. However, in considering not only the error metrics, but also the loading or importance values that both of these modelling methods produce (Figs. 2–7), evaluation of the modelling method should be reassessed to take into account the physical properties of soils and specifically soil C. The importance value plots generated in the RF modelling approach suggest that this method more specifically targets the physical properties of the soil that are linked to C rather than the general soil properties associated with the sample set. This is particularly evident in the MIR and VNIR–MIR spectral region (Figs. 4–7) where in the PLSR modelling, wavelengths that fall within the “Si\O\Si stretch” (N12,000 nm) spectral region have high relative loading contributions in the produced models. The reduced wavelength specificity seen in the PLSR models would result in the PLSR models being less suited to generalise to areas with differing soil properties, e.g. to areas where soils contain low concentration of quartzite or have higher inorganic C content. In this study PLSR produced models with slightly higher error metrics than those obtained from applying RF modelling. In terms of the modelling procedure, RF has a few advantages over the PLSR modelling approach, these include no need to transform the values (spectral mean centering, and log transformation of the C values) prior to modelling, and the impact on orthogonality amongst the PLS factors (here shown in the impact on the loading plots when different numbers of latent variables were selected in the TC, SOC and RC models). Based on the similarity of error metrics produced using either method and the points Fig. 6. Absolute summed loading plots for the identified latent variables from the selected partial least squares regression models in the VNIR–MIR spectral range for each of the carbon fractions. Highlighted are spectral regions of known band absorption features related to carbon containing compounds that coincide with identified loading peaks (Ben-Dor et al., 1999; Manchanda et al., 2002; Stuart, 2004). 238 N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 Fig. 7. Importance values for the selected RF models in the VNIR–MIR spectral range for each of the carbon fractions. Only a subset of spectral region which contains the importance feature peaks is shown, the remainder of the spectral region showed no peaks. Highlighted are spectral regions of known band absorption features related to carbon containing compounds that coincide with importance peaks (Ben-Dor et al., 1999; Manchanda et al., 2002; Stuart, 2004). highlighted above, we recommend the use of RF modelling for chemometric analysis of C fractions. Acknowledgments Funding for this study was provided by the project “Rapid Assessment and Trajectory Modeling of Changes in Soil Carbon across a Southeastern Landscape” (USDA–CSREES–NRI grant award #2007-3510718368; Agricultural and Food Research Initiative — National Institute of Food and Agriculture). We thank for contributions with field work and laboratory analysis: A. Comerford, N.B. Comerford, E. Azuaje, X. Dong, S. Moustafa, C.W. Ross, D. Sarkhot, L. Stanley, A. Stoppe, L. Muller, A. Quidez and X. Xiong. We thank the reviewer for the insightful comments that helped us with improving this manuscript. References Belay-Tedla, A., Zhou, X., Su, B., Wan, S., Luo, Y., 2009. Labile, recalcitrant, and microbial carbon and nitrogen pools of a tallgrass prairie soil in the US Great Plains subjected to experimental warming and clipping. Soil Biol. Biochem. 41, 110–116. Bellamy, P.H., Loveland, P.J., Bradley, R.I., Lark, L.R., Kirk, G., 2005. Carbon losses from all soils across England and Wales 1978–2003. Nature 437, 245–248. Bellon-Maurel, V., McBratney, A., 2011. Near-infrared (NIR) and mid-infrared (MIR) spectroscopic techniques for assessing the amount of carbon stock in soils — critical review and research perspectives. Soil Biol. Biochem. 43, 1398–1410. Ben-Dor, E., Irons, J.R., Epema, G.F., 1999. Soil reflectance. In: Rencz, A.N. (Ed.), Remote Sensing for the Earth Sciences: Manual of Remote Sensing volume 3. John Wiley & Sons, Inc., pp. 111–188. Biederbeck, V.O., Janzen, H.H., Campbell, C.A., Zentner, R.P., 1994. Labile soil organic matter as influenced by cropping practices in an arid environment. Soil Biol. Biochem. 26 (12), 1647–1656. Bradford, M.A., Fierer, N., Reynolds, J.F., 2007. Soil carbon stocks in experimental mesocosms are dependent on the rate of labile carbon, nitrogen and phosphorus inputs to soils. Funct. Ecol. 22 (6), 964–974. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Brown, D.J., 2007. Using a global VNIR soil-spectral library for local soil characterization and landscape modeling in a 2nd-order Uganda watershed. Geoderma 140, 444–453. Brown, D.J., Shepherd, K.D., Walsh, M.G., Dewayne Mays, M., Reinsch, T.G., 2006. Global soil characterization with VNIR diffuse reflectance spectroscopy. Geoderma 132, 273–290. Cambule, A.H., Rossiter, D.G., Stoorvogel, J.J., Smaling, E.M.A., 2012. Building a near infrared spectral library for soil organic carbon estimation in the Limpopo National Park, Mozambique. Geoderma 183–184, 41–48. Chang, C., Laird, D.A., Mausbach, M.J., Hurburgh Jr., C.R., 2001. Near-infrared reflectance spectroscopy — principal components regression analysis of soil properties. Soil Sci. Soc. Am. J. 65, 480–490. Craine, J.M., Gelderman, T.M., 2011. Soil moisture controls on temperature sensitivity of soil organic carbon decomposition for a mesic grassland. Soil Biol. Biochem. 43, 455–457. Davidson, E.A., Janssens, I.A., 2006. Temperature sensitivity of soil carbon decomposition and feedbacks to climate change. Nature 440, 165–173. Fierer, N., Craine, J.M., McLaughlan, K., Schimel, J.P., 2005. Litter quality and the temperature sensitivity to decomposition. Ecology 86, 320–326. Ge, Y., Morgan, C.L.S., Grunwald, S., Brown, D.J., Sarkhot, D.V., 2011. Comparison of soil reflectance spectra and calibration models obtained using multiple spectrometers. Geoderma 161 (3–4), 202–211. Geladi, P., Kowalski, B.R., 1986. Partial least-squares regression: a tutorial. Anal. Chim. Acta. 185, 1–17. Grunwald, S., Thompson, J.A., Boettinger, J.L., 2011. Digital soil mapping and modeling at continental scales — finding solutions for global issues. Soil Sci. Soc. Am. J. 75 (4), 1201–1213 (SSSA 75th Anniversary Special Paper). Gu, L., Post, W.M., King, A.W., 2004. Fast labile carbon turnover obscures sensitivity of heterotrophic respiration from soil to temperature: a model analysis. Glob. Biogeochem. Cycles 18, GB1022 (1–11). Hu, S., Coleman, D.C., Carroll, C.R., Hendrix, P.F., Beare, M.H., 1997. Labile soil carbon pools in subtropical forest and agricultural ecosystems as influenced by management practices and vegetation types. Agric. Ecosyst. Environ. 65 (1), 69–78. Janik, L., Merry, R., Skjemstad, J., 1998. Can mid infrared diffuse reflectance analysis replace soil extractions? Aust. J. Exp. Agric. 38, 681–696. Liaw, A., Wiener, M., 2002. Classification and regression by randomForest. R News 2, 18–22. Manchanda, M.L., Kudrat, M., Tiwari, A.K., 2002. Soil survey and mapping using remote sensing. Trop. Ecol. 43, 61–74. McCarty, G.W., Reeves III, J.B., Reeves, V.B., Follett, R.F., Kimble, J.M., 2002. Mid-infrared and near-infrared diffuse reflectance spectroscopy for soil carbon measurement. Soil Sci. Soc. Am. J. 66, 640–646. McDowell, M.L., Bruland, G.L., Deenik, J.L., Grunwald, S., Knox, N.M., 2012a. Soil total carbon analysis in Hawaiian soils with visible, near-infrared and mid-infrared diffuse reflectance spectroscopy. Geoderma 189–190, 312–320. McDowell, M.L., Bruland, G.L., Deenik, J.L., Grunwald, S., 2012b. Effects of subsetting by carbon content, soil order, and spectral classification on prediction of soil total carbon with diffuse reflectance spectroscopy. J. Appl. Environ. Soil Sci. 2012, 1–14. http://dx. doi.org/10.1155/2012/294121 (Article ID 294121). Mevik, B.-H., Cederkvist, H.R., 2004. Mean squared error of prediction (MSEP) estimates for principal component regression (PCR) and partial least squares regression (PLSR). J. Chemometr. 18, 422–429. Metz, B., Davidson, O.R., Bosch, P.R., Dave, R., Meyer, L.A. (Eds.), 2007. Contribution of Working Group III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change, 2007. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA. Mevik, B.-H., Wehrens, R., 2007. The pls package: principal component and partial least squares regression in R. J. Stat. Softw. 18 (2), 1–24. R Development Core Team, 2010. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Reeves, J.B., 2010. Near- versus mid-infrared diffuse reflectance spectroscopy for soil analysis emphasizing carbon and laboratory versus on-site analysis: where are we and what needs to be done? Geoderma 158, 3–14. Reeves III, J.B., McCarty, G.W., Mimmo, T., 2002. The potential of diffuse reflectance spectroscopy for the determination of carbon inventories in soils. Environ. Pollut. 116, 277–284. Reeves III, J.B., Follett, R.F., McCarty, G.W., Kimble, J.M., 2006. Can near or mid infrared diffuse reflectance spectroscopy be used to determine soil carbon pools. Commun. Soil Sci. Plant Anal. 37 (15–20), 2307–2325. N.M. Knox et al. / Geoderma 239–240 (2015) 229–239 Sarkhot, D.V., Grunwald, S., Ge, Y., Morgan, C.L.S., 2011. Comparison and detection of soil carbon under Arundo donax and coastal bermuda grass using visible/near infrared diffuse reflectance spectroscopy. Geoderma 164, 22–32. Seni, G., Elder, J.F., 2010. Ensemble methods in data mining: improving accuracy through combining predictions, synthesis. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan & Claypool Publishers. Shepherd, K.D., Walsh, M.G., 2002. Development of reflectance spectral libraries for characterization of soil properties. Soil Sci. Soc. Am. J. 66, 988–998. Stuart, B.H., 2004. Infrared Spectroscopy: Fundamentals and Applications. John Wiley & Sons, Inc. Thornley, J.H.M., Cannell, M.G.R., 2001. Soil carbon storage response to temperature: an hypothesis. Ann. Bot. (Lond.) 87 (5), 591–598. Trumbore, S.E., Chadwick, O.A., Amundson, R., 1996. Rapid exchange between soil carbon and atmospheric carbon dioxide driven by temperature change. Science 272, 393–396. Udelhoven, T., Emmerling, C., Jarmer, T., 2003. Quantitative analysis of soil chemical properties with diffuse reflectance spectrometry and partial least-square regression: a feasibility study. Plant Soil 251 (2), 319–329. Vasques, G.M., Grunwald, S., Sickman, J.O., 2008. Comparison of multivariate methods for inferential modeling of soil carbon using visible/near-infrared spectra. Geoderma 146, 14–25. 239 Vasques, G.M., Grunwald, S., Sickman, J.O., 2009. Visible/near-infrared spectroscopy modeling of dynamic soil carbon fractions. Soil Sci. Soc. Am. J. 73, 176–184. Vasques, G.M., Grunwald, S., Harris, W.G., 2010. Building a spectral library to estimate soil organic carbon in Florida. J. Environ. Qual. 39, 923–934. Vasques, G.M., Grunwald, S., Myers, D.B., 2012. Multi-scale behavior of soil carbon at nested locations in Florida, USA. Landsc. Ecol. 27, 355–367. http://dx.doi.org/10. 1007/s10980-011-9702-3. Viscarra Rossel, R.A., Behrens, T., 2010. Using data mining to model and interpret soil diffuse reflectance spectra. Geoderma 158 (1–2), 46–54. Viscarra Rossel, R.A., Walvoort, D.J.J., McBratney, A.B., Janik, L.J., Skjemstad, J.O., 2006. Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties. Geoderma 131, 59–75. Vohland, M., Besold, J., Hill, J., Fründ, H.-C., 2011. Comparing different multivariate calibration methods for the determination of soil organic carbon pools with visible to near infrared spectroscopy. Geoderma 166 (1), 198–255. Wehrens, R., 2011. Chemometrics with R. In: Gentleman, R., Hornik, K., Pamigiani, G. (Eds.), Use R! Springer-Verlag, Heidelberg. Williams, P.C., 1987. Variables affecting near-infrared reflectance spectroscopic analysis. In: Williams, P., Norris, K. (Eds.), Near-infrared Technology in the Agricultural and Food Industries. American Association of Cereal Chemists, St. Paul, MN, pp. 143–167.