...

Modelling soil carbon fractions with visible near-infrared (VNIR) and ⁎ N.M. Knox

by user

on
Category: Documents
13

views

Report

Comments

Transcript

Modelling soil carbon fractions with visible near-infrared (VNIR) and ⁎ N.M. Knox
Geoderma 239–240 (2015) 229–239
Contents lists available at ScienceDirect
Geoderma
journal homepage: www.elsevier.com/locate/geoderma
Modelling soil carbon fractions with visible near-infrared (VNIR) and
mid-infrared (MIR) spectroscopy
N.M. Knox a,b,c, S. Grunwald c,⁎, M.L. McDowell d,1, G.L. Bruland d,2, D.B. Myers c,3, W.G. Harris c
a
Earth Observation Division, South African National Space Agency (SANSA), PO Box 484, Silverton 0127, South Africa
University of KwaZulu-Natal, School of Agricultural, Earth and Environmental Sciences, Private Bag X01, Scottsville 3209, Pietermaritzburg, South Africa
Soil and Water Science Department, University of Florida, 2181 McCarty Hall, PO Box 110290, Gainesville, FL 32611, USA
d
Natural Resources and Environmental Management Department, University of Hawai'i Mānoa, 1910 East–West Rd, Sherman 101, Honolulu, HI 96822, USA
b
c
a r t i c l e
i n f o
Article history:
Received 4 March 2014
Received in revised form 15 October 2014
Accepted 26 October 2014
Available online 9 November 2014
Keywords:
Partial least squares regression
Random forest regression
Total carbon
Soil organic carbon
Hydrolysable carbon
Recalcitrant carbon
Chemometric modelling
a b s t r a c t
Accurate assessment of soil carbon fractions would provide valuable contributions towards monitoring in ecological observatories, assessment of disturbance impacts, global climate and land use change. The majority of chemometric modelling studies have focused on measuring only total soil carbon (C), with only a few evaluating
individual soil C pools. Analysis of pools allows for a more detailed picture of ecosystem processes, specifically
decomposition and accretion of C in soils. This study evaluated the potential of the visible near infrared
(VNIR), mid infrared (MIR) and a combined VNIR–MIR spectral region to estimate and predict soil C fractions.
Partial least squares regression (PLSR) and random forest (RF) ensemble tree regression models were used to estimate four different soil C fractions. The soil C fractions analysed included total — (TC), organic — (SOC), recalcitrant — (RC) and hydrolysable carbon (HC). The sample set contained 1014 soil samples collected across the
state of Florida, USA. Laboratory analysis revealed the wide range of total and organic C values, from 1 to
523 g·kg−1, with only about 10% of the samples containing inorganic C which was therefore omitted from the
study. Both PLSR and RF modelling were shown to be effective in modelling all soil C fractions, with as much
as 94–96% of the variation in the TC, SOC and RC pools, and 86% of HC being explained by the models. Although
both PLSR and RF models were successful in modelling C fractions, RF models appear to target the physical properties linked to the property being analysed, and may therefore be the better modelling method to use when
generalising to new areas. This study demonstrates that diffuse reflectance spectroscopy is an effective method
for non-destructive analysis of soil C fractions, and through the use of RF modelling a spectral range between
2000 and 6000 nm should suffice to model these soil C fractions.
© 2014 Elsevier B.V. All rights reserved.
1. Introduction
Soil carbon (C) sequestration has been targeted by the Intergovernmental Panel on Climate Change (IPCC) as one of the main means to
mitigate rise in atmospheric C levels (Bellon-Maurel and McBratney,
2011; Metz et al., 2007). Total soil C can be separated into different
soil C fractions based on their turnover rates. Whether the soil is a
source or a sink of C depends on the balance between the C recalcitrance
Abbreviations: DRS, diffuse reflectance spectroscopy; HC, hydrolysable carbon; MIR,
mid infrared; PLSR, partial least squares regression; RC, recalcitrant carbon; RF, random forest regression; RMSE, root mean square error; RPD, residual prediction deviation; SOC, soil
organic carbon; SWIR, short wave infrared; TC, total carbon; VNIR, visible near infrared.
⁎ Corresponding author.
E-mail address: sabgru@ufl.edu (S. Grunwald).
1
Present Address: Scitor Corporation, 385 Moffett Park Dr, Sunnyvale, CA 94089, USA.
2
Present Address: Principia College, Biology and Natural Resources Department, Elsah,
IL 62028, USA.
3
Present Address: Division of Plant Sciences, University of Missouri, 214 Waters Hall,
Columbia, MO 65211, USA.
http://dx.doi.org/10.1016/j.geoderma.2014.10.019
0016-7061/© 2014 Elsevier B.V. All rights reserved.
(chemical stability of the soils) and evolution of C cycling processes (i.e.
the impact of decomposition, accretion, etc.). There has been much debate about the impact of climate warming on soil C (Bellamy et al., 2005;
Craine and Gelderman, 2011; Davidson and Janssens, 2006; Fierer et al.,
2005; Thornley and Cannell, 2001; Trumbore et al., 1996) and the uncertainty involved in monitoring soil C across large regions is still substantial (Grunwald et al., 2011). There is in addition a need to assess
the impact of land use shifts and disturbances (e.g., wildfire) on soils;
however a rapid, robust and effective means to monitor changes in
soil C at regional, continental and global scales is lacking. As a response
to these needs research on spectral-based methods has exploded over
the last decade and introduced new avenues including soil C models informed by proximal sensors, such as visible near infrared (VNIR) and
mid infrared (MIR) diffuse reflectance spectroscopy (DRS). Importantly,
these methods have been shown to yield comparable results to traditional time-consuming and more costly laboratory methods (BellonMaurel and McBratney, 2011).
Diffuse reflectance spectroscopy in the VNIR spectral range has been
used widely to characterize soil organic carbon (SOC) and soil total
230
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
carbon (TC) (Brown, 2007; Ge et al., 2011; Sarkhot et al., 2011;
Shepherd and Walsh, 2002; Udelhoven et al., 2003; Vasques et al.,
2008, 2009, 2010). Soil OC models derived from VNIR spectra differed
widely with studies covering 8.1–19.8 g kg− 1 (Viscarra Rossel et al.,
2006), 6.1–33.0 g kg−1 (Reeves et al., 2002), 2.3–55.8 g kg− 1
(Shepherd and Walsh, 2002), 1.3–285.8 g kg− 1 (Chang et al., 2001),
and 0.1–536.8 g kg−1 (Brown et al., 2006) soil C suggesting a more universal applicability of models the wider their C content range. Midinfrared spectroscopy has been demonstrated to predict SOC and TC,
often with better accuracy than VNIR derived models. For example,
McDowell et al. (2012a) predicted TC from VNIR and MIR spectral
data on Hawaiian soils that are highly diverse and complex in terms of
their mineralogy and soil texture. In the study by McDowell et al.
(2012b) it was demonstrated that TC predictions derived from VNIR
and from MIR spectra yielded robust results with R2 values of 0.94 or
greater, root mean square prediction error (RMSE) values ranged from
2.28%–3.08%, and residual prediction deviation (RPD) of 3.38 or more.
In their study the models derived from VNIR and MIR spectra were
contrasted, with both spectral regions not being combined to form single spectra. Other studies, such as by McCarty et al. (2002), Reeves et al.
(2006), and Reeves (2010) have demonstrated the success of MIR to
predict soil C.
The majority of these studies have focused on measuring only total
soil C, with only a few evaluating individual soil C pools. Yet these
pools allow a more detailed picture of ecosystem processes, specifically
decomposition and accretion of C in soils. Thus, pools (i.e., chemically
extracted soil C fractions) have unique capabilities to reflect the evolution of soil C as a result of various imposed stressors (e.g., climate and
land use change). According to Belay-Tedla et al. (2009) C and nitrogen
(N) fluxes are largely controlled by the small but highly bio-reactive, labile pools of these elements in terrestrial soils, while long-term C and N
storage is determined by the long-lived recalcitrant fractions. In their
study they found that recalcitrant and total C and N contents were not
significantly affected by warming, however, labile and microbial biomass C fractions showed significant interactions with warming.
Labile C, although often a small fraction of SOC, significantly affects
heterotrophic respiration at short timescales because of its rapid decomposition (Gu et al., 2004). In a study by Bradford et al. (2007) it
was shown that there are significant interactive effects when the formation and decomposition responses of different soil C fractions are considered. Hence, the total amount of soil C stocks was dependent on
the rate of labile soil C, N and phosphorus (P) inputs to soils. Labile
soil C has also been shown to respond to effects in management
(Biederbeck et al., 1994). Proximal sensing technology has the potential
to infer not only on total soil C content, but also on soil C fractions. However, this research is still in its infancy. For example, Vasques et al.
(2009) used VNIR spectra to estimate various soil C fractions and mineralizable C in a subtropical region dominated by sand-rich soils. Also
Sarkhot et al. (2011) estimated SOC, inorganic C, and labile C in a
small floodplain area on clay-rich soils in Texas. More work needs to
still be done in the field of chemometric modelling of C fractions
encompassing diverse soil types to determine whether spectral modelling would be a robust method for predicting of soil C fractions.
Recently, Bellon-Maurel and McBratney (2011) critically reviewed
both NIR and MIR spectroscopic techniques for soil C studies. Their review highlighted that MIR spectroscopy yielded better (10–40% improvement) models than models developed from NIR spectra. Of the
studies reviewed none of them compared NIR and MIR spectroscopy
on the same soil samples indicating a critical research gap which is addressed in this study. From the studies they reviewed, they applied
either multiple linear regression (MLR) or partial least squares regression (PLSR) for their chemometric models. The simpler MLR models
were shown to perform comparably to the PLSR models; however,
both methods showed shortcomings in terms of wavelengths selected.
Judicious selection of wavelengths is imperative when applying a
model to new areas where the soil C range might be similar, but the
soil characteristics differ from those where the model was built. Wavelength selection is likely to be an important contributor to increasing the
bias in results when the validation dataset is acquired from an “out of
field” source. Both methods were found to be biased when applied to
new soil samples with different compositions. Recently, the use of
non-linear modelling methods, such as ensemble tree regression
modelling, has been shown to improve model performances in terms
of output error metrics (Vasques et al., 2009, 2010; Viscarra Rossel
and Behrens, 2010; Vohland et al., 2011).
The focus of this study was to determine the potential of both the
VNIR and MIR spectral regions to estimate soil C fractions. Partial least
squares regression (PLSR) and random forest (RF) regression models
were derived to estimate soil C fractions. These methods were evaluated
firstly to determine which method could best predict soil C fractions,
and secondly it was determined if either method produced output
that is attributed to the physics of soil C properties. We assume that if
the most important model factors (i.e., the importance factors for RF
and loadings for PLSR) can be attributed to physical soil C properties
the models would be robust if generalised and applied to other areas
in the future. The soil C fractions focussed on in this study include
total C (TC), soil organic C (SOC), recalcitrant C (RC), and hydrolysable
C (HC). VNIR and MIR chemometric models were derived from a wide
selection of soil samples collected over the state of Florida, USA. The
study objectives were to determine the predictive capabilities of the
VNIR, MIR and combined VNIR–MIR spectral datasets for predicting
soil C fractions (TC, SOC, RC, and HC). Within this overall objective we
evaluated the following sub-objectives:
1. Determine which spectral region performs best.
2. Evaluate which modelling method (PLSR or RF) performs best based
on validation datasets.
3. Identify if there is a best method in terms of data preparation for C
fraction chemometric analysis.
2. Methods
2.1. Field sampling
A total of 1014 sample sites were located over the state of Florida
using a stratified random sampling design. The strata were defined
based upon soil suborder data obtained from the Soil Data Mart — Soil
Survey Geographic Database (SSURGO) (Natural Resources Conservation Service — http://soildatamart.nrcs.usda.gov/), and a reclassified
land use/cover map derived from the 2003 Florida Vegetation and
Land Cover Data map prepared by the Florida Fish and Wildlife Conservation Commission, Tallahassee, Florida. Dominant soils in the state are
Spodosols (32%), Entisols (22%), Ultisols (19%), Alfisols (13%), Histosols
(11%), and Mollisols and Inceptisols together occupy b3% of the state
(Vasques et al., 2010).
The soil samples were collected over a period of 1.25 years between
2008 and 2009. Sample sites were located using a differential global positioning system. At each site four 20 × 5.8 cm surface soil cores were
collected from within a 2 m diameter area. These four cores were bulked
and cooled in the field and then transported and processed in the lab. In
addition to the sampled soils, data was collected of the soil suborder and
land use practice found at each site.
2.2. Laboratory methods
The bulk samples were air-dried and sieved to retrieve the fine earth
fraction (b 2 mm), thoroughly mixed and then stored. Subsamples were
taken from these dried, sieved and mixed bulk samples for the laboratory analysis. A portion of these subsamples were ball milled for use in
some of the laboratory procedures outlined below.
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
2.2.1. Chemical analysis
Chemical analysis for deriving the respective C fractions was performed on all the soil samples and included a 5% replication. The C fractions were measured using a Shimadzu TOC-VCPN catalytic combustion
oxidation instrument with a SSM-5000a solid sample module
(Shimadzu Scientific Instruments, Kyoto, Japan), following their different pre-processing. Total C (TC) was measured on 80–700 mg ball
milled samples combusted at 900 °C. Inorganic C (IC) was derived by
CO2 evolution from a reaction of 20–250 mg ball milled samples with
42.5% phosphoric acid at 200 °C. Soil organic C (SOC) was calculated
by subtraction of IC from the TC. Hot water extractable ‘labile’ C (hydrolysable carbon — HC) was measured by incubating 4 g of soil in 40 mL
(1:10) of double de-ionized water for 16 h at 80 °C. Samples were
then filtered to 0.22 μm. The non-hydrolysable ‘recalcitrant’ C (RC)
was measured by digesting 2 g of the ball milled soil in 10 mL of 5 M
HCL under reflux conditions for 16 h. The soil digest was washed 3
times by centrifuge, dried and the remaining undigested C was then
combusted at 900 °C.
A Spearman's correlation was run to compare the relationships between the different C fractions. This was used as a preliminary comparison to determine the potential similarity in the derived chemometric
models.
2.2.2. Spectral data acquisition and pre-model processing
2.2.2.1. VNIR. Soil samples were scanned with an Analytical Spectral Devices Lab Spec 5000 (Analytical Spectral Devices, Boulder, CO). The spectral range of the instrument covers 350 to 2500 nm, with a spectral
resolution of 3 nm in the VNIR detector, and 10 nm in the two
short wave infrared (SWIR) detector ranges. The sampling range is
1.377 nm in the VNIR range and 2 nm in the SWIR range, producing a
spectrum length of 2151 points. The soil samples (b2 mm) were volumetrically controlled (5.5 cm3) in 46 mm petri dishes and oven dried
at moderate heat (about 45 °C) for at least 12 h before scanning. The
samples were cooled for an hour prior to taking the spectral measurements. Samples were scanned from above at four positions (90° rotations) using a custom designed rotating sample stage. A Spectralon
(Labsphere, North Sutton, NH) white reference panel measurement
was taken, followed by 25 scans in one of the four directions, the sample
was then rotated, and the same procedure followed. After all four directions had been sampled internally the 100 measured spectra were then
averaged to produce a single spectrum per sample. The calibrated
Spectralon reference panel allows for the measured radiance measurements to be converted into relative reflectance values.
2.2.2.2. MIR. Mid-infrared DRS spectra were collected from ball milled
subsamples using a Scimitar 2000 FTIR spectrometer (Varian, Inc.,
now Agilent Technologies, Santa Clara, CA) outfitted with a diffuse reflectance infrared Fourier transform (DRIFT) accessory. The instrument
measures reflectance over the range of 400–6000 cm− 1 (1666–
25,000 nm) with a sampling interval of 2 cm−1 and spectral resolution
of 4 cm−1. Each spectrum was the averaged result of 32 co-added scans.
KBr powder in the first of the eight sample holders provided a background spectrum that was subtracted from each scan to correct for atmosphere and instrument effects. A new background spectrum was
collected at the beginning of a session and between every seven samples. The KBr correction highlighted two narrow regions of the spectrum that were affected by fluctuating atmospheric conditions and
exhibited residual atmospheric features. These regions fell between
1350–1419 cm−1 and 2281–2449 cm−1 and were therefore excluded
from further analysis and interpretation.
2.2.2.3. Outlier detection. Outlier detection was performed on both the
VNIR and MIR spectra separately prior to modelling. The assumption
made in the outlier detection was that a spectrum should resemble another spectrum taken within the same soil suborder. Spectra collected
231
within the VNIR spectral range were grouped into their respective soil
suborders. For each soil suborder a mean spectrum ±2 standard deviations from this mean was calculated. Each individual spectrum of the respective soil suborder was then compared against the spectral range
defined for the suborder. If greater than 75% of this spectrum fell outside
the entire spectral range it was considered to be an outlier and was excluded from further modelling. The same outlier detection process was
applied to the MIR spectra.
2.2.2.4. VNIR & MIR fusion. Prior to data fusion the MIR spectral range
measured in wave numbers (cm− 1), was converted into nanometre
(nm) wavelength units (nm = 10,000,000/wave number). The first
stage of fusing the spectral data involved matching the respective sample sites that contained spectra from both spectral regions, bearing in
mind that some samples were removed during outlier detection. The
VNIR and MIR spectral ranges overlap between 1666–2500 nm (SWIR
region), providing the opportunity to compare the spectral behaviour
between these two instruments. Across this overlapping spectral
range it was found that the MIR spectra were on average between 10
and 30% brighter (i.e., higher reflectance value) than those recorded in
the VNIR range. This difference therefore had to be accounted for
when fusing the two spectra. Typically reflectance values of soil spectra
fall below 10% in the MIR spectral region between 5000 and 7500 nm
(2000–1330 cm−1), it was decided to adjust the VNIR range upwards,
thus preventing any region of the fused spectrum falling below a reflectance value of 10%. Per paired spectrum (i.e., VNIR + MIR spectrum for a
sample site) the mean difference across the overlapping region was calculated, and this mean difference value was then added to the VNIR
spectrum, thus adjusting these VNIR spectra upwards. Spectra acquired
in the SWIR region from the VNIR-DRS instrument had a higher signal to
noise ratio compared to the spectrum acquired from the MIR-DRS instrument in this region of the spectrum, as a result of this region falling
into the edge of the detector limits. In fusing the adjusted VNIR and the
MIR data, it was decided to use the entire VNIR-DRS spectrum, and fuse
it to the remaining MIR spectrum from 2500 to 25,000 nm. The end results of this spectral fusion process were integrated spectra covering the
spectral range from 350 to 25,000 nm (Fig. 1). To reduce the modelling
processing time applied to these fused spectra the spectra were resampled to 5 nm increments across the spectral range.
2.3. Carbon fraction modelling
The three spectral datasets were used to create regression models
for the prediction of soil C fractions. In order to test the predictive capabilities of the developed models the datasets were split into a calibration/validation (70/30%) set. The data split was based upon a stratified
random selection of samples within each of the field sampling strata.
Samples were assigned to the calibration/validation split prior to the
outlier detection and VNIR & MIR data fusion. Following the removal
of samples from the outlier detection and VNIR & MIR data fusion the
calibration dataset contained 696 samples and the validation set
contained 296 samples. All C fraction values were log transformed
prior to modelling. The presented results are based on the respective
log transformed C values.
Two different chemometric modelling methods were applied in the
spectral analysis to test their abilities in modelling and predicting different soil C fractions. Partial least squares regression, a linear modelling
technique frequently applied in chemometric analysis (Geladi and
Kowalski, 1986; Mevik and Wehrens, 2007; Wehrens, 2011), and RF ensemble tree regression (Breiman, 2001; Seni and Elder, 2010; Wehrens,
2011) a newer non-linear modelling technique, were both applied.
Background noise is inherent within spectral data. The effects of
noise filtering and preprocessing methods was tested following the application of a variety of different background scatter removal processing
techniques, prior to modelling. Nine common scatter correction
methods were applied (Wehrens, 2011) and compared to reflectance
232
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
Fig. 1. Mean combined VNIR–MIR spectrum for each of the soil orders collected in this study.
spectra that had only been averaged with no further processing applied
(RAW). The methods included: multiplicative scatter correction (MSC),
MSC followed by a 5 window smoothing Savitzky–Golay filter applied to
1st derivative spectra (MSC-1st D), standard normal variate (SNV) correction, 5 window smoothing Savitzky–Golay filter (SG), 5 window
quadratic smoothing Savitzky–Golay filter (SG-Quad), 1st derivative-5
window smoothing Savitzky–Golay filter (SG-1st D), 1st derivative-5
window quadratic smoothing Savitzky–Golay filter (SG-1st D-Quad),
absorbance (log 10(1/x), where x is the reflectance value), and transformation to absorbance and then application of the 1st derivative-5 window smoothing Savitzky–Golay filter (log 10 (1/x) SG-1st D). Prior to
regression model implementation the spectral data were mean centred.
The regression modelling was implemented in R (R Development
Core Team, 2010), using the packages “pls” (Mevik and Wehrens,
2007) for the PLSR modelling and the package “randomForest” (Liaw
and Wiener, 2002) for the RF modelling. For the PLSR modelling, a maximum limit on latent variables to be used for the models was set at 20.
For each developed model the number of latent variables (NLV) selected
was based upon the selection of the first minima defined in the prediction residual error sum of squares (PRESS) or if no local minima occurred then the LV which produced the flattest slope between each
PRESS value was selected (Mevik and Cederkvist, 2004). In addition to
validation using the external validation dataset, an internal k-fold
cross validation was performed using a random split into 15 segments.
For RF modelling, following a pre-test (data not shown) it was verified
that with 400 splits for the maximum of 500 trees in the forest produced
comparable results to the default split value calculated as number of parameters/3 (Seni and Elder, 2010). Reduction in the number of splits resulted in a substantial reduction in processing time of the RF models.
The developed models were then applied to the independent validation dataset and the predictive powers of the models were assessed in
terms of the produced R2, root mean square error (RMSE), and residual
prediction deviation (RPD) values (Williams, 1987). The RPD is defined
as RPD = SD/SEP, where SD is the standard deviation of the dataset and
SEP is the standard error of prediction. For each C fraction, spectral
range and modelling method the model that reduced the error metrics
the most was selected. The variable loadings/importance values were
then evaluated and are discussed with respect to whether the selected
model outcomes are based upon physical properties related to C and
what effect this might have in implementing these models to new
datasets from different regions.
3. Results and discussion
3.1. Laboratory analysis
The summary statistics of the C fractions for the samples used in the
spectral modelling are presented in Table 1. In terms of C variability, excessively well drained quartz-dominated, marine sands have low biomass production rates relative to mineralization rates and hence are
low in C. Marshes and wetland areas, on the other hand, are high in C
due to high productivity and anaerobic conditions (Vasques et al.,
2012). The soils in this region cover a wide range of the attribute feature
space with TC values ranging from about 1 to 523 g·kg−1 making this a
unique testing ground for this study (Table 1).
In comparison with previous spectral studies conducted in different
areas the range of total and organic C is very broad (Janik et al., 1998;
McCarty et al., 2002; Vasques et al., 2008). The samples taken across
Florida contain very low inorganic C content, resulting in similar content
levels in the total and organic C fractions (Table 1). As a result of only a
few of the samples containing IC, this C pool was omitted from the
study. The analysis shows that the bulk of the C within Florida is concentrated within the stable recalcitrant pool, as measured by RC, compared
to the low concentrations of C found in the labile C pool, as measured by
HC. Nevertheless, the labile carbon pool is critical because it represents
the dynamic component of the C cycle and measurements of HC can indicate effects due to land use, climate, moisture or other landscape
changes more rapidly than measurements of TC or SOC (Bradford
et al., 2007; Hu et al., 1997).
The correlation analysis (Table 2) between the different C fractions
shows that in Florida TC, SOC and RC levels in the soil are highly
Table 1
Summary statistics for laboratory analysis derived carbon fractions divided into the calibration and validation sets. All fractions are presented in g kg−1.
Carbon fraction
Total C (TC)
Organic C (SOC)
Recalcitrant C
(RC)
Hydrolysable C
(HC)
Calibration (n = 696)
Validation (n = 296)
Min
Max
Mean
Median
Min
Max
Mean
Median
1.33
1.33
0.67
523.27
523.27
502.07
32.00
31.54
21.13
11.65
11.48
6.18
3.43
3.43
1.93
430.64
430.64
375.70
37.94
37.21
26.87
11.77
11.56
5.83
0.05
19.24
0.88
0.53
0.11
9.30
0.91
0.52
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
Table 2
Spearman's correlation analysis of the paired soil carbon fraction soil samples acquired
over the state of Florida.
Carbon fraction
TC
SOC
RC
Total C (TC)
Organic C (SOC)
Recalcitrant C (RC)
Hydrolysable C (HC)
HC
0.99
0.96
0.87
0.97
0.87
0.84
correlated, and HC is less related to these three other fractions. Based on
this we expect that the chemometric models derived from TC, SOC and
RC will be similar and that there will be considerable differences in the
models derived for HC.
Pre-processing had a greater impact on the modelling approach applied rather than the spectral region being considered. For PLSR modelling the RAW spectral data produced validation error metrics that were
only slightly poorer than the best performing pre-processing method.
This was irrespective of the carbon fraction or spectral region being considered (Table 3). In models developed using RF regression spectral preprocessing improved the predictive capability of models by an order of
Table 3
Partial least square validation error metrics produced using the different spectral pre-processing methods applied to the spectral data from the VNIR, MIR and combined VNIR–MIR
spectral regions of four log transformed soil carbon fractions (log g·kg−1). The highlighted
cells are the methods that minimised the error metrics and refer to the models that were
selected for the model comparisons that are discussed and highlighted in Figs. 1, 3 and 5,
and Table 4.
TC
SOC
RC
HC
Pre-processinga
RAW
MSC
MSC-1st D
SNV
SG
SG–Quad
SG–1st D
SG–1st D–Quad
Log 10 (1/x)
Log 10 (1/x), SG–1st D
RAW
MSC
MSC-1st D
SNV
SG
SG–Quad
SG–1st D
SG–1st D–Quad
Log 10 (1/x)
Log 10 (1/x), SG–1st D
RAW
MSC
MSC-1st D
SNV
SG
SG–Quad
SG–1st D
SG–1st D–Quad
Log 10 (1/x)
Log 10 (1/x), SG–1st D
RAW
MSC
MSC-1st D
SNV
SG
SG–Quad
SG–1st D
SG–1st D–Quad
Log 10 (1/x)
Log 10 (1/x), SG–1st D
2–8% (Table 4). For RF modelling it appears that a minimum application
of smoothing on 1st derivative spectra will improve model performance
(Table 4). From this comparative study no single “best” pre-processing
method can be suggested for C fraction modelling using spectral processing. It is shown however, that MSC and SNV in general perform
poorly (Tables 3 and 4), this contrasts from other studies that have
shown these two pre-processing methods to improve their models
(Cambule et al., 2012; McDowell et al., 2012a; Vasques et al., 2008).
Given the similarity in concentrations of the TC and SOC fractions
(only about 10% of the samples were different due to measurable IC) it
is not surprising to see that the validation outcomes with respect to
pre-processing are very similar for both the PLSR and RF modelling procedures (Tables 3 and 4).
3.3. Carbon fraction model evaluations
3.2. Pre-processing
C Fractions
233
R2
0.86
0.76
0.79
0.78
0.86
0.86
0.86
0.86
0.81
0.61
0.85
0.75
0.81
0.78
0.85
0.85
0.85
0.87
0.82
0.58
0.82
0.72
0.75
0.77
0.82
0.81
0.82
0.83
0.85
0.65
0.81
0.73
0.72
0.74
0.81
0.81
0.8
0.8
0.82
0.68
VNIR
RMSE
0.41
0.53
0.50
0.51
0.41
0.41
0.41
0.41
0.47
0.68
0.42
0.55
0.48
0.52
0.42
0.43
0.42
0.40
0.46
0.70
0.52
0.65
0.61
0.59
0.52
0.53
0.52
0.51
0.48
0.73
0.36
0.42
0.43
0.41
0.36
0.36
0.36
0.36
0.35
0.46
RPD
2.6
2.0
2.2
2.1
2.6
2.6
2.6
2.6
2.3
1.6
2.6
2.0
2.3
2.1
2.6
2.5
2.6
2.7
2.3
1.5
2.3
1.8
2.0
2.1
2.3
2.3
2.3
2.4
2.5
1.7
2.3
1.9
1.9
1.9
2.3
2.3
2.2
2.2
2.3
1.7
R2
0.95
0.91
0.87
0.93
0.95
0.95
0.94
0.94
0.95
0.94
0.95
0.91
0.87
0.93
0.95
0.95
0.94
0.93
0.96
0.94
0.94
0.91
0.86
0.93
0.94
0.94
0.92
0.92
0.94
0.92
0.86
0.84
0.74
0.84
0.86
0.86
0.79
0.78
0.86
0.80
MIR
RMSE
0.24
0.33
0.41
0.29
0.24
0.24
0.27
0.27
0.24
0.27
0.24
0.33
0.41
0.28
0.24
0.24
0.27
0.28
0.23
0.27
0.30
0.36
0.45
0.32
0.31
0.30
0.35
0.35
0.30
0.33
0.31
0.33
0.42
0.33
0.31
0.31
0.38
0.38
0.30
0.37
RPD
4.5
3.3
2.6
3.8
4.5
4.5
4.0
4.0
4.6
4.0
4.5
3.3
2.6
3.8
4.5
4.5
4.0
3.9
4.7
4.0
4.0
3.4
2.6
3.7
3.9
4.0
3.4
3.4
4.0
3.6
2.6
2.4
1.9
2.4
2.6
2.6
2.1
2.1
2.6
2.2
R2
0.95
0.90
0.89
0.92
0.95
0.95
0.95
0.95
0.95
0.95
0.95
0.91
0.88
0.92
0.95
0.95
0.95
0.95
0.95
0.95
0.93
0.90
0.89
0.91
0.93
0.93
0.93
0.93
0.93
0.92
0.86
0.83
0.80
0.84
0.86
0.86
0.86
0.86
0.86
0.81
VNIR-MIR
RMSE
0.24
0.35
0.37
0.31
0.24
0.23
0.24
0.24
0.25
0.25
0.24
0.33
0.38
0.31
0.23
0.23
0.23
0.23
0.23
0.24
0.31
0.38
0.39
0.36
0.31
0.32
0.31
0.31
0.31
0.33
0.30
0.33
0.36
0.33
0.30
0.30
0.30
0.30
0.30
0.35
RPD
4.5
3.0
2.9
3.5
4.5
4.6
4.5
4.5
4.3
4.3
4.5
3.2
2.8
3.4
4.6
4.6
4.6
4.6
4.7
4.4
3.9
3.1
3.0
3.3
3.9
3.7
3.8
3.9
3.8
3.6
2.7
2.4
2.2
2.4
2.7
2.7
2.7
2.7
2.6
2.3
a
The different pre-processing techniques applied: RAW = averaged; MSC = multiplicative scatter correction; MSC-1st D = MSC followed by a 5 window smoothing
Savitzky–Golay filter applied to 1st derivative spectra; SNV = standard normal variate
correction, SG = 5 window smoothing Savitzky–Golay filter; SG-Quad = 5 window quadratic smoothing Savitzky–Golay filter; SG-1st D = 1st derivative-5 window smoothing
Savitzky–Golay filter; SG-1st D-Quad = 1st derivative-5 window quadratic smoothing
Savitzky–Golay filter; log 10(1/x) = absorbance (where x is the reflectance value); log
10 (1/x) SG-1st D = transformation to absorbance and then application of the 1st derivative-5 window smoothing Savitzky–Golay filter.
We have shown that the various C fractions measured in this study
can be predicted with high accuracy through the application of chemometric models. The best prediction models were capable of explaining
up to 86% of the variation in the labile (HC) pool and as much as 94–
96% of the variation in the TC, SOC and RC pools (Tables 5 and 6).
For the dataset studied we found that when comparing PLSR and RF
as techniques for modelling C fractions, both methods produce comparable error metrics in all the spectral ranges studied, which is similar to
the findings of McDowell et al. (2012a). When evaluating the error metrics in terms of the spectral ranges in their ability to model the separate
C fractions we find that the MIR and VNIR–MIR spectral ranges produced C fraction models of higher predictive capabilities than models
produced using only the VNIR range. This is in agreement with the findings of Reeves (2010), though the differences between the two spectral
ranges were closer than he found. Combining the VNIR and MIR spectral
range however, does not dramatically improve models when compared
to using only the MIR spectral range (Tables 3/5 and 4/6).
The cross-validation error metrics shown in both the PLSR and RF selected models (Tables 5 and 6) are lower than the validation (and calibration metrics for the PLSR models) metrics. Inherent in the RF
modelling process is the application of an “Out of Bag” (OOB) crossvalidation technique similar to the internal k-fold cross validation that
was performed in the PLSR modelling procedure. This internal crossvalidation method has been documented to produce on average lower
error metrics for the test models (Wehrens, 2011), but produces robust
output models, which are shown by the comparable results of the calibration and validation sets in the RF models. These comparable models
indicate that they are not overfitted. This is further verified by the similarity in the validation error metrics produced by both the PLSR and RF
models.
The error metrics produced by the RF and PLSR modelling show that
both methods produced models with similar predictive capabilities.
Analysis of the spectral loadings or importance values generated during
the RF and PLSR modelling showed that these methods highlight and accentuate different spectral properties and wavelengths in each of the
spectral regions. Higher loading and the importance values, indicate
variables that contribute more weight to the prediction models. A consistent finding, irrespective of the spectral range considered, is that the
RF models have only a few specific wavelengths across the respective
spectral ranges that are of high importance to the models. The important wavelengths identified for the different C fractions tend to coincide
in the RF models; however, the relative intensity of these wavelengths
differs. The PLSR wavelength loadings, in contrast, tend to be broader,
and for all the spectral regions TC and SOC models are similar. In contrast, HC and RC show different wavelength loading properties; some
were similar to the TC and SOC, and others were substantially different.
Below we highlight the model spectral properties in each of the
analysed spectral regions with respect to the modelling method applied
and the C fraction modelled.
234
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
Table 4
Random forest validation error metrics produced using the different spectral pre-processing methods applied to the spectral data from the VNIR, MIR and combined VNIR_MIR
spectral regions of four log transformed soil carbon fractions (log g·kg−1). The highlighted
cells are the methods that minimised the error metrics and refer to the models that were
selected for the model comparisons that are discussed and highlighted in Figs. 2, 4 and 6,
and Table 5.
C Fractions
TC
SOC
RC
HC
Pre-processinga
RAW
MSC
MSC-1st D
SNV
SG
SG–Quad
SG–1st D
SG–1st D–Quad
Log 10 (1/x)
Log 10 (1/x), SG–1st D
RAW
MSC
MSC-1st D
SNV
SG
SG–Quad
SG–1st D
SG–1st D–Quad
Log 10 (1/x)
Log 10 (1/x), SG–1st D
RAW
MSC
MSC-1st D
SNV
SG
SG–Quad
SG–1st D
SG–1st D–Quad
Log 10 (1/x)
Log 10 (1/x), SG–1st D
RAW
MSC
MSC-1st D
SNV
SG
SG –Quad
SG–1st D
SG–1st D–Quad
Log 10 (1/x)
Log 10 (1/x), SG–1st D
R2
0.85
0.60
0.80
0.63
0.85
0.85
0.88
0.88
0.85
0.80
0.86
0.60
0.80
0.63
0.86
0.86
0.88
0.88
0.86
0.80
0.83
0.58
0.78
0.60
0.83
0.83
0.87
0.87
0.82
0.77
0.75
0.57
0.73
0.58
0.74
0.74
0.80
0.80
0.74
0.74
VNIR
RMSE
0.42
0.70
0.51
0.69
0.42
0.42
0.39
0.39
0.42
0.49
0.41
0.70
0.51
0.68
0.41
0.41
0.38
0.38
0.41
0.49
0.51
0.80
0.60
0.79
0.51
0.51
0.46
0.46
0.52
0.58
0.41
0.53
0.42
0.52
0.41
0.41
0.36
0.36
0.41
0.41
RPD
2.6
1.5
2.1
1.6
2.6
2.6
2.8
2.8
2.6
2.2
2.7
1.5
2.1
1.6
2.6
2.6
2.8
2.8
2.6
2.2
2.4
1.5
2.0
1.5
2.4
2.4
2.6
2.6
2.3
2.1
2.0
1.5
1.9
1.5
2.0
2.0
2.2
2.2
2.0
2.0
R2
0.92
0.93
0.92
0.93
0.92
0.92
0.94
0.94
0.92
0.93
0.92
0.93
0.92
0.93
0.92
0.92
0.94
0.94
0.92
0.93
0.91
0.92
0.92
0.92
0.91
0.91
0.93
0.93
0.91
0.93
0.77
0.80
0.79
0.79
0.77
0.77
0.81
0.81
0.78
0.81
MIR
RMSE
0.31
0.28
0.31
0.28
0.30
0.31
0.28
0.28
0.31
0.29
0.30
0.29
0.31
0.28
0.30
0.30
0.28
0.28
0.30
0.29
0.37
0.35
0.35
0.35
0.36
0.37
0.34
0.34
0.38
0.33
0.39
0.36
0.37
0.37
0.38
0.39
0.35
0.35
0.38
0.36
RPD
3.5
3.8
3.5
3.8
3.6
3.5
3.9
3.9
3.5
3.8
3.6
3.8
3.5
3.8
3.5
3.6
3.9
3.9
3.6
3.7
3.3
3.5
3.4
3.5
3.3
3.2
3.6
3.6
3.2
3.6
2.1
2.2
2.2
2.2
2.1
2.1
2.3
2.3
2.1
2.3
R2
0.92
0.92
0.94
0.93
0.91
0.92
0.95
0.95
0.92
0.95
0.92
0.92
0.94
0.92
0.92
0.92
0.94
0.94
0.92
0.95
0.91
0.91
0.93
0.91
0.91
0.91
0.93
0.93
0.91
0.93
0.77
0.80
0.85
0.80
0.77
0.77
0.85
0.85
0.77
0.85
VNIR-MIR
RMSE RPD
0.31
3.5
0.3
3.6
0.26
4.2
0.29
3.6
0.31
3.4
0.31
3.5
0.25
4.2
0.25
4.2
0.31
3.5
0.25
4.3
0.30
3.5
0.30
3.6
0.27
4.0
0.29
3.6
0.31
3.5
0.30
3.5
0.26
4.1
0.26
4.1
0.30
3.5
0.25
4.2
0.37
3.3
0.35
3.4
0.33
3.6
0.36
3.3
0.36
3.3
0.36
3.3
0.32
3.7
0.32
3.8
0.37
3.2
0.32
3.7
0.38
2.1
0.36
2.2
0.32
2.5
0.36
2.2
0.39
2.1
0.39
2.1
0.31
2.6
0.31
2.6
0.39
2.1
0.31
2.6
a
The different pre-processing techniques applied: RAW = averaged; MSC = multiplicative scatter correction; MSC-1st D = MSC followed by a 5 window smoothing
Savitzky-Golay filter applied to 1st derivative spectra; SNV = standard normal variate
correction, SG = 5 window smoothing Savitzky–Golay filter; SG-Quad = 5 window quadratic smoothing Savitzky–Golay filter; SG-1st D = 1st derivative-5 window smoothing
Savitzky–Golay filter; SG-1st D-Quad = 1st derivative-5 window quadratic smoothing
Savitzky–Golay filter; log 10(1/x) = absorbance (where x is the reflectance value); log
10 (1/x) SG-1st D = transformation to absorbance and then application of the 1st derivative-5 window smoothing Savitzky–Golay filter.
3.3.1. VNIR
The behaviour of the TC and SOC wavelength loadings is distinctly
different from that of RC and HC in the PLSR-VNIR models (Fig. 2). The
RC and HC pools show broad wavelength loadings that coincide with
the border between the O\H and C\H combination stretches, and a
peak associated with the water absorption feature between 1850 and
Table 5
Calibration, cross-validation and validation error metrics for the selected partial
least square models (the models highlighted in Table 2) derived from the three
spectral datasets modelled to derive and predict soil log transformed carbon fractions (log g·kg− 1) across Florida.
Spectrum
VNIR
MIR
VNIR–
MIR
C
fraction
Calibration
R2
RMSE RPD R2
Cross-validation
RMSE RPD R2
RMSE RPD
TC
SOC
RC
HC
TC
SOC
RC
HC
TC
SOC
RC
HC
0.85
0.85
0.80
0.81
0.96
0.95
0.91
0.84
0.96
0.96
0.91
0.88
0.40
0.39
0.49
0.36
0.21
0.23
0.34
0.33
0.20
0.21
0.33
0.29
0.44
0.45
0.52
0.39
0.25
0.26
0.36
0.36
0.23
0.24
0.36
0.32
0.41
0.40
0.48
0.35
0.24
0.23
0.30
0.30
0.23
0.23
0.31
0.30
2.6
2.6
2.3
2.3
4.3
4.4
3.3
2.5
5.0
4.8
3.4
2.9
0.81
0.80
0.78
0.78
0.94
0.94
0.90
0.81
0.95
0.94
0.90
0.85
2.3
2.2
2.1
2.1
4.0
4.0
3.1
2.3
4.4
4.2
3.1
2.5
Validation
0.86
0.87
0.85
0.82
0.95
0.96
0.94
0.86
0.95
0.95
0.93
0.86
2.6
2.7
2.5
2.3
4.6
4.7
4.0
2.6
4.6
4.7
3.9
2.7
1920 nm. In contrast, TC and SOC have a number of narrower peaks
that coincide with known absorption features. A prominent peak at
1000 nm although falling within the broad N\H/O\H 2nd overtone
stretch feature has also been associated with a small narrow Si\O
stretch feature (Manchanda et al., 2002). Similar to the RC and HC wavelength loadings, prominent peaks occur at the border between the O\H
and C\H combination stretch and the water feature between 1850 and
1920 nm. A third peak occurs in the overlapping zone between the C\H
combination stretch and the N\H/O\H 1st overtone stretch. The peak
that occurs at 1850 nm is right on the edge of the C\H 1st overtone
stretch. Although no artefacts were observed in the collected spectra,
both 1000 nm and 1850 nm wavelengths coincide with the overlap regions of the spectral detectors in the ASD instrument, and may therefore
be artefacts caused by the spectroradiometer.
The RF C fraction models within the VNIR range contrast significantly to the PLSR models (Fig. 3 vs 2). Unlike the PLS loadings the important
wavelengths identified in the RF models form narrow peaks, rather than
the broader peaks observed in the PLS loading plots. In the RF models,
for TC, SOC and RC, the most prominent peak occurs in the orange region of the visible spectrum (590–620 nm), with some smaller peaks
in the yellow region (570–590 nm). For these three fractions some
smaller peaks are linked to N\H/O\H/C\H combination stretches between 2000 and 2500 nm. Hydrolysable C shares peak locations with
the other three fractions, but unlike the other fractions a peak in the
N\H/O\H combined stretch region has greater importance than the
contribution of visible spectrum wavelengths. McDowell et al. (2012a)
also found similar peaks in their RF models above 2000 nm for TC, however their strongest peaks were in this region rather than in the visible
spectral region as we found in this study.
3.3.2. MIR
The loading plots produced by the PLS-MIR models are noisy across
the entire MIR spectral region (Fig. 4). Unlike the models created in the
VNIR and VNIR–MIR spectral regions the TC and SOC models are different across most of the entire spectral range, however these two models
do share the locations of the principal loading peaks. In the automated
selection of the number of latent variables (NLV) for these four models
18, 15, 15 and 15 NLV were selected for TC, SOC, RC and HC, respectively,
the three additional variables selected in the TC model result in the differences observed between the TC and SOC. Similar to the results seen
with the VNIR spectral range, all C fraction loading peaks occur in the
spectral range associated with O\H bonds, in this instance with the
O\H fundamental bond vibrations. The peak observed at 7500 nm for
all C fractions has not been associated with any soil bond vibrations,
and needs to be investigated further to determine if this is a shift of
C\O stretch feature (although a second smaller feature does occur
within the C\O stretch region), or in fact relates to another property.
The multiple smaller loading peaks across the spectral range between
12,000 and 25,000 nm are of interest in the PLSR models. This region
is associated with Si\O bond vibrational stretches and would tie in
with the high quartz content found in the majority of the Florida soil
samples.
The variable importance plots of the four C fractions produced by RF
modelling again shows completely different behaviour to that seen in
the PLSR modelling. Only a small component of the MIR spectral range
contributes to modelling the four C fractions (Fig. 5). In contrast to the
PLSR models there are no important variables associated with the
Si\O bond features, and the important variables are associated directly
with C bond features. Similar to what was observed in the VNIR spectral
range, HC does not have as many wavelength-variable peaks that contribute to its model. In the MIR range the HC importance features are
dominated by 3 narrow peaks linked to C_O stretching. Though these
are the dominant variables linked to TC, SOC and RC, they also include
features in the C`C absorption region. In the C`C region RC has only
a single peak, compared to the two peaks displayed by TC and SOC.
The locations of the peaks in the region of 5500 nm are similar to
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
Table 6
Cross-validation and validation error metrics for the selected Random Forest models (the
models highlighted in Table 3) derived from the three spectral datasets modelled to derive
and predict soil log transformed carbon fractions (log g·kg−1) across Florida. In addition
based on the random forest analysis of VNIR–MIR it was identified that the region
2000–6000 nm could be used to model the different C fractions, the final row provides
the validation findings for the different carbon fractions run on this subset of the combined
spectrum.
Spectrum
C fraction
Cross-validation
R2
VNIR
MIR
VNIR–MIR
2000–6000 nm
TC
SOC
RC
HC
TC
SOC
RC
HC
TC
SOC
RC
HC
TC
SOC
RC
HC
RMSE
0.84
0.41
0.84
0.41
0.79
0.51
0.79
0.38
0.92
0.30
0.91
0.31
0.88
0.39
0.77
0.41
0.92
0.29
0.92
0.29
0.89
0.37
0.84
0.34
Log 10 (1/x) SG-1st Da
Log 10 (1/x) SG-1st Da
SG-1st Da
SG-1st D-Quada
Validation
RPD
R2
RMSE
RPD
2.4
2.4
2.2
2.1
3.4
3.3
2.8
2.0
3.5
3.5
3.0
2.4
0.88
0.88
0.87
0.80
0.94
0.94
0.93
0.81
0.95
0.95
0.93
0.85
0.95
0.94
0.93
0.85
0.39
0.38
0.46
0.36
0.28
0.28
0.33
0.35
0.25
0.25
0.32
0.31
0.25
0.27
0.33
0.31
2.8
2.8
2.6
2.2
3.9
3.9
3.6
2.3
4.3
4.2
3.8
2.6
4.2
4.0
3.6
2.6
a
The pre-processing steps providing the highest calibration and validation output for
the 2000–6000 nm spectral range: SG-1st D = 1st derivative-5 window smoothing
Savitzky–Golay filter; SG-1st D-Quad = 1st derivative-5 window quadratic smoothing
Savitzky–Golay filter; log 10 (1/x) SG-1st D = transformation to absorbance and then application of the 1st derivative-5 window smoothing Savitzky–Golay filter.
235
When the VNIR and MIR ranges are combined for the PLSR modelling (Fig. 6), none of the features that were prominent in the PLSR
models generated for the VNIR region (Fig. 2) are prominent in the combined spectrum. The combined range loadings are however smoother
than those presented in the loading plot produced by using only the
MIR (Fig. 4). Comparison of the MIR (Fig. 4) and VNIR–MIR (Fig. 6) loading plots shows similar loading peaks, with the most prominent peak
occurring at 2700 nm associated with O\H fundamentals, a double feature in C\O stretch region and the broad double loading peaks in the
Si\O stretch regions. A feature between 3950 and 4000 nm has also
been related to O\H stretching, and occurs in both the MIR and
VNIR–MIR loading plots. In the VNIR–MIR region the automated PLSR
NLV selection procedure for the four C fractions resulted in 18, 17, 13
and 18 NLV being selected for the TC, SOC, RC and HC models,
respectively.
Similar to what is seen in the VNIR and MIR RF importance plots
(Figs. 3 and 5), in the VNIR–MIR importance plot only a very small region of the entire spectrum shows variables that were important to
the RF modelling of the C fractions (Fig. 7). The MIR and VNIR–MIR RF
models shared two prominent features located in the C_O stretch
and the C`C stretch region (Figs. 5 and 7). The VNIR–MIR HC model differed from its MIR model by containing peaks in two regions, the one in
the same C_O stretch region and the second located in the NIR at a feature associated with a combined O\H/N\H bond stretch (Figs. 5, 6 and
7). Unlike the MIR models the VNIR–MIR models for TC, SOC and RC in
addition also display an important variable associated with the edge of
the O\H stretch bond feature at 4000 nm (Figs. 5 and 7).
4. Conclusions
those seen in McDowell et al. (2012a), though in their study they did
not observe peaks within the C`C spectral feature.
3.3.3. VNIR–MIR
This is the first study to the authors' knowledge that looks at the impacts on C modelling resulting from fusing the VNIR and MIR spectral
regions into a single spectrum. In creating a combined spectrum it becomes possible to see the role of the fundamental versus overtone features in the chemometric models. Both the PLSR and RF models show
that the wavelengths contributing most to their respective models are
all located above 2000 nm with little contribution from the visible
through to near infrared region below 2000 nm.
Most chemometric studies of soil C modelling have focused on generating models to estimate total or organic C concentrations. In this
study, by contrast, we have shown that chemometric modelling is useful
to accurately predict soil C fractions. This approach could contribute towards monitoring C for ecological observatories, C trading, assessing
disturbance impacts, as well as assessing the effects of global climate
and land use change on soils. Total C, SOC and RC are all highly correlated in Florida soils. We have shown that models generated for TC and
SOC are in most instances nearly identical. Although RC is highly correlated to TC and SOC the models generated for RC produce slightly poorer
error metrics, and particularly when modelling using PLSR, the RC
model loadings differ from those generated in the TC and SOC models.
Fig. 2. Absolute summed loading plots for the identified latent variables from the selected partial least squares regression models in the VNIR spectral range for each of the carbon fractions.
Highlighted on the plots are spectral regions of known band absorption features related to carbon containing compounds that coincide with identified loading peaks (Ben-Dor et al., 1999;
Manchanda et al., 2002; Stuart, 2004).
236
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
Fig. 3. Importance values for the selected random forest models in the VNIR spectral range for each of the carbon fractions. Highlighted are spectral regions of known band absorption
features related to carbon containing compounds that coincide with importance peaks (Ben-Dor et al., 1999; Manchanda et al., 2002; Stuart, 2004).
Deviation of HC in Florida in terms of the correlation, error metrics and
output model loadings indicates that this C fraction warrants development of its own individual chemometric models.
The two broad aims of this study were firstly to perform a comparison between the VNIR and MIR spectral regions and assess their ability
to model and estimate the different fractions of soil C, and secondly to
evaluate the performance of two different chemometric modelling
methods in modelling and estimating these soil C fractions. Conclusions
based on these aims are highlighted below.
4.1. Spectral region comparison for modelling soil carbon fractions
We have found that the MIR region produces slightly better results
than the VNIR region, which is similar to earlier findings presented in
Bellon-Maurel and McBratney (2011). In addition, we have found that
in a combined spectrum the region above 2000 nm contributes most
to modelling of C fractions. If RF modelling is used it should be possible
to model C fractions using only the spectral regions between 2000 and
6000 nm.
Based on this finding we ran an RF analysis on our dataset to determine if the spectral region from 2000 to 6000 nm will in fact enable
modelling of the different soil carbon fractions with a comparable accuracy to that of the entire combined spectrum. In Table 5 we provide the
validation findings from this analysis. This region provided comparable
results to using the entire VNIR–MIR combined spectrum and confirms
that this spectral region alone can be used to predict soil carbon fractions. Through this spectral fusion approach one is able to balance the
decreased signal to noise ratio (SNR) experienced at the edge of both
spectroradiometric instruments, and capture the spectral behaviour associated with the fundamental overtone region of the spectrum.
Bellon-Maurel and McBratney (2011) concluded through the analysis of multiple studies that models within the VNIR region identified the
range of 1650–2500 nm as the best range for C modelling. This study
found that using either PLS or RF modelling the region above 2000 nm
was most important. We speculate that inclusion of visible spectrum
is most likely location/area specific, as it would relate to colour of the
soil, and its inclusion in models would reduce their ability to generalise
to new areas. The robustness of models in the validations suggests that
models could be applied to new soil spectra to predict soil carbon
Fig. 4. Absolute summed loading plots for the identified latent variables from the selected partial least squares regression models in the MIR spectral range for each of the carbon fractions.
Highlighted are spectral regions of known band absorption features related to carbon containing compounds that coincide with identified loading peaks (Ben-Dor et al., 1999; Manchanda
et al., 2002; Stuart, 2004).
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
237
Fig. 5. Importance values for the selected random forest regression models in the MIR spectral range for each of the carbon fractions. Highlighted are spectral regions of known band absorption features related to carbon containing compounds that coincide with importance peaks (Ben-Dor et al., 1999; Manchanda et al., 2002; Stuart, 2004). A reduced spectral range is
displayed in this plot to highlight the regions where peaks are seen, the remainder of the MIR spectrum is relatively flat.
fractions, assuming that soil samples were collected from within the
soil-geographic domain covered in this dataset.
4.2. Modelling methods
Models derived from PLSR in the MIR and VNIR–MIR range have
slightly higher RPD values and slightly lower RMSE values compared
to the results obtained from the RF modelling for all C fractions. On
the other hand in the VNIR range, with the exception of HC, RF modelling produced slightly better results in terms of higher RPD and lower
RMSE values (Tables 5 and 6). A one-dimensional interpretation of
these findings would lead one to conclude that PLSR as a modelling
method outperforms RF modelling. However, in considering not only
the error metrics, but also the loading or importance values that both
of these modelling methods produce (Figs. 2–7), evaluation of the
modelling method should be reassessed to take into account the physical properties of soils and specifically soil C.
The importance value plots generated in the RF modelling approach
suggest that this method more specifically targets the physical
properties of the soil that are linked to C rather than the general soil
properties associated with the sample set. This is particularly evident
in the MIR and VNIR–MIR spectral region (Figs. 4–7) where in the
PLSR modelling, wavelengths that fall within the “Si\O\Si stretch”
(N12,000 nm) spectral region have high relative loading contributions
in the produced models. The reduced wavelength specificity seen in
the PLSR models would result in the PLSR models being less suited to
generalise to areas with differing soil properties, e.g. to areas where
soils contain low concentration of quartzite or have higher inorganic C
content.
In this study PLSR produced models with slightly higher error metrics than those obtained from applying RF modelling. In terms of the
modelling procedure, RF has a few advantages over the PLSR modelling
approach, these include no need to transform the values (spectral mean
centering, and log transformation of the C values) prior to modelling,
and the impact on orthogonality amongst the PLS factors (here shown
in the impact on the loading plots when different numbers of latent variables were selected in the TC, SOC and RC models). Based on the similarity of error metrics produced using either method and the points
Fig. 6. Absolute summed loading plots for the identified latent variables from the selected partial least squares regression models in the VNIR–MIR spectral range for each of the carbon
fractions. Highlighted are spectral regions of known band absorption features related to carbon containing compounds that coincide with identified loading peaks (Ben-Dor et al., 1999;
Manchanda et al., 2002; Stuart, 2004).
238
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
Fig. 7. Importance values for the selected RF models in the VNIR–MIR spectral range for each of the carbon fractions. Only a subset of spectral region which contains the importance feature
peaks is shown, the remainder of the spectral region showed no peaks. Highlighted are spectral regions of known band absorption features related to carbon containing compounds that
coincide with importance peaks (Ben-Dor et al., 1999; Manchanda et al., 2002; Stuart, 2004).
highlighted above, we recommend the use of RF modelling for chemometric analysis of C fractions.
Acknowledgments
Funding for this study was provided by the project “Rapid Assessment and Trajectory Modeling of Changes in Soil Carbon across a Southeastern Landscape” (USDA–CSREES–NRI grant award #2007-3510718368; Agricultural and Food Research Initiative — National Institute
of Food and Agriculture). We thank for contributions with field work
and laboratory analysis: A. Comerford, N.B. Comerford, E. Azuaje, X.
Dong, S. Moustafa, C.W. Ross, D. Sarkhot, L. Stanley, A. Stoppe, L. Muller,
A. Quidez and X. Xiong. We thank the reviewer for the insightful comments that helped us with improving this manuscript.
References
Belay-Tedla, A., Zhou, X., Su, B., Wan, S., Luo, Y., 2009. Labile, recalcitrant, and microbial
carbon and nitrogen pools of a tallgrass prairie soil in the US Great Plains subjected
to experimental warming and clipping. Soil Biol. Biochem. 41, 110–116.
Bellamy, P.H., Loveland, P.J., Bradley, R.I., Lark, L.R., Kirk, G., 2005. Carbon losses from all
soils across England and Wales 1978–2003. Nature 437, 245–248.
Bellon-Maurel, V., McBratney, A., 2011. Near-infrared (NIR) and mid-infrared (MIR) spectroscopic techniques for assessing the amount of carbon stock in soils — critical review and research perspectives. Soil Biol. Biochem. 43, 1398–1410.
Ben-Dor, E., Irons, J.R., Epema, G.F., 1999. Soil reflectance. In: Rencz, A.N. (Ed.), Remote
Sensing for the Earth Sciences: Manual of Remote Sensing volume 3. John Wiley &
Sons, Inc., pp. 111–188.
Biederbeck, V.O., Janzen, H.H., Campbell, C.A., Zentner, R.P., 1994. Labile soil organic matter as influenced by cropping practices in an arid environment. Soil Biol. Biochem. 26
(12), 1647–1656.
Bradford, M.A., Fierer, N., Reynolds, J.F., 2007. Soil carbon stocks in experimental
mesocosms are dependent on the rate of labile carbon, nitrogen and phosphorus inputs to soils. Funct. Ecol. 22 (6), 964–974.
Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32.
Brown, D.J., 2007. Using a global VNIR soil-spectral library for local soil characterization
and landscape modeling in a 2nd-order Uganda watershed. Geoderma 140, 444–453.
Brown, D.J., Shepherd, K.D., Walsh, M.G., Dewayne Mays, M., Reinsch, T.G., 2006. Global
soil characterization with VNIR diffuse reflectance spectroscopy. Geoderma 132,
273–290.
Cambule, A.H., Rossiter, D.G., Stoorvogel, J.J., Smaling, E.M.A., 2012. Building a near infrared spectral library for soil organic carbon estimation in the Limpopo National Park,
Mozambique. Geoderma 183–184, 41–48.
Chang, C., Laird, D.A., Mausbach, M.J., Hurburgh Jr., C.R., 2001. Near-infrared reflectance
spectroscopy — principal components regression analysis of soil properties. Soil Sci.
Soc. Am. J. 65, 480–490.
Craine, J.M., Gelderman, T.M., 2011. Soil moisture controls on temperature sensitivity of
soil organic carbon decomposition for a mesic grassland. Soil Biol. Biochem. 43,
455–457.
Davidson, E.A., Janssens, I.A., 2006. Temperature sensitivity of soil carbon decomposition
and feedbacks to climate change. Nature 440, 165–173.
Fierer, N., Craine, J.M., McLaughlan, K., Schimel, J.P., 2005. Litter quality and the temperature sensitivity to decomposition. Ecology 86, 320–326.
Ge, Y., Morgan, C.L.S., Grunwald, S., Brown, D.J., Sarkhot, D.V., 2011. Comparison of soil reflectance spectra and calibration models obtained using multiple spectrometers.
Geoderma 161 (3–4), 202–211.
Geladi, P., Kowalski, B.R., 1986. Partial least-squares regression: a tutorial. Anal. Chim.
Acta. 185, 1–17.
Grunwald, S., Thompson, J.A., Boettinger, J.L., 2011. Digital soil mapping and modeling at
continental scales — finding solutions for global issues. Soil Sci. Soc. Am. J. 75 (4),
1201–1213 (SSSA 75th Anniversary Special Paper).
Gu, L., Post, W.M., King, A.W., 2004. Fast labile carbon turnover obscures sensitivity of heterotrophic respiration from soil to temperature: a model analysis. Glob. Biogeochem.
Cycles 18, GB1022 (1–11).
Hu, S., Coleman, D.C., Carroll, C.R., Hendrix, P.F., Beare, M.H., 1997. Labile soil carbon pools
in subtropical forest and agricultural ecosystems as influenced by management practices and vegetation types. Agric. Ecosyst. Environ. 65 (1), 69–78.
Janik, L., Merry, R., Skjemstad, J., 1998. Can mid infrared diffuse reflectance analysis replace soil extractions? Aust. J. Exp. Agric. 38, 681–696.
Liaw, A., Wiener, M., 2002. Classification and regression by randomForest. R News 2,
18–22.
Manchanda, M.L., Kudrat, M., Tiwari, A.K., 2002. Soil survey and mapping using remote
sensing. Trop. Ecol. 43, 61–74.
McCarty, G.W., Reeves III, J.B., Reeves, V.B., Follett, R.F., Kimble, J.M., 2002. Mid-infrared
and near-infrared diffuse reflectance spectroscopy for soil carbon measurement.
Soil Sci. Soc. Am. J. 66, 640–646.
McDowell, M.L., Bruland, G.L., Deenik, J.L., Grunwald, S., Knox, N.M., 2012a. Soil total carbon analysis in Hawaiian soils with visible, near-infrared and mid-infrared diffuse reflectance spectroscopy. Geoderma 189–190, 312–320.
McDowell, M.L., Bruland, G.L., Deenik, J.L., Grunwald, S., 2012b. Effects of subsetting by
carbon content, soil order, and spectral classification on prediction of soil total carbon
with diffuse reflectance spectroscopy. J. Appl. Environ. Soil Sci. 2012, 1–14. http://dx.
doi.org/10.1155/2012/294121 (Article ID 294121).
Mevik, B.-H., Cederkvist, H.R., 2004. Mean squared error of prediction (MSEP) estimates
for principal component regression (PCR) and partial least squares regression
(PLSR). J. Chemometr. 18, 422–429.
Metz, B., Davidson, O.R., Bosch, P.R., Dave, R., Meyer, L.A. (Eds.), 2007. Contribution of
Working Group III to the Fourth Assessment Report of the Intergovernmental Panel
on Climate Change, 2007. Cambridge University Press, Cambridge, United Kingdom
and New York, NY, USA.
Mevik, B.-H., Wehrens, R., 2007. The pls package: principal component and partial least
squares regression in R. J. Stat. Softw. 18 (2), 1–24.
R Development Core Team, 2010. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
Reeves, J.B., 2010. Near- versus mid-infrared diffuse reflectance spectroscopy for soil analysis emphasizing carbon and laboratory versus on-site analysis: where are we and
what needs to be done? Geoderma 158, 3–14.
Reeves III, J.B., McCarty, G.W., Mimmo, T., 2002. The potential of diffuse reflectance spectroscopy for the determination of carbon inventories in soils. Environ. Pollut. 116,
277–284.
Reeves III, J.B., Follett, R.F., McCarty, G.W., Kimble, J.M., 2006. Can near or mid infrared diffuse reflectance spectroscopy be used to determine soil carbon pools. Commun. Soil
Sci. Plant Anal. 37 (15–20), 2307–2325.
N.M. Knox et al. / Geoderma 239–240 (2015) 229–239
Sarkhot, D.V., Grunwald, S., Ge, Y., Morgan, C.L.S., 2011. Comparison and detection of soil
carbon under Arundo donax and coastal bermuda grass using visible/near infrared diffuse reflectance spectroscopy. Geoderma 164, 22–32.
Seni, G., Elder, J.F., 2010. Ensemble methods in data mining: improving accuracy through
combining predictions, synthesis. Synthesis Lectures on Data Mining and Knowledge
Discovery. Morgan & Claypool Publishers.
Shepherd, K.D., Walsh, M.G., 2002. Development of reflectance spectral libraries for characterization of soil properties. Soil Sci. Soc. Am. J. 66, 988–998.
Stuart, B.H., 2004. Infrared Spectroscopy: Fundamentals and Applications. John Wiley &
Sons, Inc.
Thornley, J.H.M., Cannell, M.G.R., 2001. Soil carbon storage response to temperature: an
hypothesis. Ann. Bot. (Lond.) 87 (5), 591–598.
Trumbore, S.E., Chadwick, O.A., Amundson, R., 1996. Rapid exchange between soil carbon
and atmospheric carbon dioxide driven by temperature change. Science 272,
393–396.
Udelhoven, T., Emmerling, C., Jarmer, T., 2003. Quantitative analysis of soil chemical properties with diffuse reflectance spectrometry and partial least-square regression: a feasibility study. Plant Soil 251 (2), 319–329.
Vasques, G.M., Grunwald, S., Sickman, J.O., 2008. Comparison of multivariate methods for
inferential modeling of soil carbon using visible/near-infrared spectra. Geoderma
146, 14–25.
239
Vasques, G.M., Grunwald, S., Sickman, J.O., 2009. Visible/near-infrared spectroscopy
modeling of dynamic soil carbon fractions. Soil Sci. Soc. Am. J. 73, 176–184.
Vasques, G.M., Grunwald, S., Harris, W.G., 2010. Building a spectral library to estimate soil
organic carbon in Florida. J. Environ. Qual. 39, 923–934.
Vasques, G.M., Grunwald, S., Myers, D.B., 2012. Multi-scale behavior of soil carbon at
nested locations in Florida, USA. Landsc. Ecol. 27, 355–367. http://dx.doi.org/10.
1007/s10980-011-9702-3.
Viscarra Rossel, R.A., Behrens, T., 2010. Using data mining to model and interpret soil diffuse reflectance spectra. Geoderma 158 (1–2), 46–54.
Viscarra Rossel, R.A., Walvoort, D.J.J., McBratney, A.B., Janik, L.J., Skjemstad, J.O., 2006. Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties. Geoderma 131, 59–75.
Vohland, M., Besold, J., Hill, J., Fründ, H.-C., 2011. Comparing different multivariate calibration methods for the determination of soil organic carbon pools with visible to near
infrared spectroscopy. Geoderma 166 (1), 198–255.
Wehrens, R., 2011. Chemometrics with R. In: Gentleman, R., Hornik, K., Pamigiani, G.
(Eds.), Use R! Springer-Verlag, Heidelberg.
Williams, P.C., 1987. Variables affecting near-infrared reflectance spectroscopic analysis.
In: Williams, P., Norris, K. (Eds.), Near-infrared Technology in the Agricultural and
Food Industries. American Association of Cereal Chemists, St. Paul, MN, pp. 143–167.
Fly UP