# Bootstrapping the Cate-Nelson

Getting confidence intervals on the hard splits.
agriculture
statistics
modeling
soil chemistry
soil testing
soiltestcorr
R
Published

January 30, 2023

Modified

January 30, 2023

The Cate-Nelson (1971) is a method used to estimate the critical soil test value among a group of sites. This method does not provide a response curve, but rather displays two hard lines, one horizontal and one vertical. Sites to the left of the vertical line are generally expected see crop responses to fertilizer. These quadrant lines can be established statistically (minimizing/maximizing certain stats) or at a fixed, target Y value.

But what about the uncertainty around these line? What would it look like to calculate bootstrap confidence intervals for the critical soil test value?

My approach presented here is based on bootstrap confidence intervals, which has a tidyverse accent. You can also read about bootstraps from the `nlraa` package. For Cate-Nelson1 in R, the `soiltestcorr` and `rcompanion` packages are your main options.

• 1 Also, I might just call the Cate-Nelson “CN” at some points.

``````library(tidyverse) # for data manipulation and plotting and so on
library(tidymodels)# for case resampling
library(devtools)  # for sourcing stuff from GitHub
library(soiltestcorr)
library(rcompanion)
library(ggridges) # to make cool distributions later
library(ggpointdensity) # to make cool points later``````

## The Data

Let’s bring in some pseudo-fake data.

``````corr_data <-
mutate(y = y * 100) # percentage scale

plot_corr <- corr_data %>%
ggplot(aes(x, y)) +
geom_hline(yintercept = 100, alpha = 0.2) +
geom_point(alpha = 0.3) +
geom_rug(alpha = 0.5, length = unit(2, "pt")) +
scale_x_continuous(breaks = seq(0, 300, 20)) +
scale_y_continuous(limits = c(0, 120),
breaks = seq(0, 120, 10)) +
labs(title = paste(
"Relative yield and soil test concentration for", nrow(corr_data), "fertilizer trials"),
y = "Relative crop yield (%)",
x = "Soil Test (mg/kg)")

plot_corr`````` In this soil test correlation dataset, more nutrient measured in the soil should correlate to more crop yield. The relationship does not look linear (as expected) and is somewhat noisy. The Cate-Nelson procedure attempts to cut through this noise.

## Doing the Cate-Nelson Once

Let’s look at the critical value by running the Cate-Nelson each way.

### Fixed Y value

``soiltestcorr::cate_nelson_1965(corr_data, x, y, target = 95, plot = TRUE)`` ``soiltestcorr::cate_nelson_1965(corr_data, x, y, target = 95)\$CSTV``
`` 97``
``````rcompanion::cateNelsonFixedY(corr_data\$x, corr_data\$y,
cly = 95, plotit = FALSE) |> head()``````
``````  Critx Crity Q1 Q2 Q3 Q4 Model Error   N   pQ1   pQ2   pQ3   pQ4 pModel pError
1    96    95 13 94 35 40   134    48 182 0.071 0.516 0.192 0.220  0.736  0.264
2    96    95 13 94 35 40   134    48 182 0.071 0.516 0.192 0.220  0.736  0.264
3    98    95 13 94 35 40   134    48 182 0.071 0.516 0.192 0.220  0.736  0.264
4    94    95 12 95 37 38   133    49 182 0.066 0.522 0.203 0.209  0.731  0.269
5    94    95 12 95 37 38   133    49 182 0.066 0.522 0.203 0.209  0.731  0.269
6    94    95 12 95 37 38   133    49 182 0.066 0.522 0.203 0.209  0.731  0.269
Fisher.p Pearson.chisq Pearson.p    phi
1 2.293e-09         34.27 4.809e-09 -0.446
2 2.293e-09         34.27 4.809e-09 -0.446
3 2.293e-09         34.27 4.809e-09 -0.446
4 7.920e-09         32.49 1.197e-08 -0.435
5 7.920e-09         32.49 1.197e-08 -0.435
6 7.920e-09         32.49 1.197e-08 -0.435``````

This approach estimated 97 mg kg-1 as the critical value to achieve 95% of the potential yield.

### Free

``soiltestcorr::cate_nelson_1971(corr_data, x, y, plot = TRUE)``
``````Warning in stats::chisq.test(data.frame(row.1, row.2)): Chi-squared
approximation may be incorrect`````` ``soiltestcorr::cate_nelson_1971(corr_data, x, y)\$CSTV``
``````Warning in stats::chisq.test(data.frame(row.1, row.2)): Chi-squared
approximation may be incorrect``````
`` 47.5``
``````rcompanion::cateNelson(corr_data\$x, corr_data\$y, trend = "positive",
plotit = FALSE, verbose = FALSE, progress = FALSE)``````
``````    n  CLx       SS   CLy Q.I Q.II Q.III Q.IV Q.Model  p.Model Q.Error
1 182 47.5 8979.932 68.25   5  169     1    7     176 0.967033       6
p.Error Fisher.p.value Cramer.V
1 0.03296703   5.290161e-09   0.6991``````

This approach estimated just 47.5 mg kg-1 as the critical value, but the relative yield is only 68%.

## Doing the Cate-Nelson Over and Over: Option 1

Now to take this and put it into a framework for bootstrapping the tidy way. First, make a function that can fit the Cate-Nelson (CN) on the analysis portion of the data split created with the `rsample` package2. Then create the bootstraps, and fit the CN model to each. But what if a model fails? The `purrr` package includes `possibly()` which I’ve used to produce a NULL result for some reason the CN fitting fails3.

• 2 I sort of like adding the `map(splits, analysis)` step to pull out the analysis data from the split and save those resamplings in a new `data` column, just to prove that the data was really randomly sampled.

• 3 Probably won’t but I did it for the `nls()` functions previously.

• The following is for the Cate-Nelson with a fixed RY target.

``````fit_cn <- function(data, target){
cstv <- cateNelsonFixedY(
x = data\$x,
y = data\$y,
cly = target,
plotit = FALSE,
trend = "positive")[1, 1]
return(cstv)
}

# Must set seed to reproduce the same results every time
n_reps <- 1000 # sometimes denoted as "R"

set.seed(8675309)
cn_bootstraps <-
corr_data %>%
bootstraps(times = n_reps) %>%
mutate(data = map(splits, analysis),
ry = 95, # target
cstv = map2_dbl(data, ry, possibly(fit_cn, otherwise = NULL))) |>
select(-splits)

cn_bootstraps``````
``````# A tibble: 1,000 × 4
id            data                  ry  cstv
<chr>         <list>             <dbl> <dbl>
1 Bootstrap0001 <tibble [182 × 3]>    95   113
2 Bootstrap0002 <tibble [182 × 3]>    95    60
3 Bootstrap0003 <tibble [182 × 3]>    95    96
4 Bootstrap0004 <tibble [182 × 3]>    95    96
5 Bootstrap0005 <tibble [182 × 3]>    95    96
6 Bootstrap0006 <tibble [182 × 3]>    95   110
7 Bootstrap0007 <tibble [182 × 3]>    95   103
8 Bootstrap0008 <tibble [182 × 3]>    95    71
9 Bootstrap0009 <tibble [182 × 3]>    95   124
10 Bootstrap0010 <tibble [182 × 3]>    95    96
# ℹ 990 more rows``````

### Confidence Intervals

Next, we can get a confidence interval for `cstv` with `quantile()`, which in this case is set to 90%.

``````ci_90 <- quantile(cn_bootstraps\$cstv, c(0.05, 0.95), na.rm = TRUE) |> round()

ci_90``````
`````` 5% 95%
70 125 ``````

Pretty cool.

But what does it all look like?!

## Visualization

What does the distribution of the critical soil test value look like?

``````cn_bootstraps %>%
ggplot(aes(cstv, y = 0,
fill = factor(after_stat(quantile)))) +
ggridges::stat_density_ridges(
quantile_lines = TRUE,
quantiles = c(0.05, 0.5, 0.95)) +
scale_color_manual(values = c("#CE1141", "#A0A0A088")) +
scale_fill_manual(
name = "Probability",
values = c("#CE1141", "#A0A0A088", "#A0A0A088", "#13274F"),
labels = c("(0, 0.05]", "(0.05, 0.5)", "(0.5, 0.95)", "(0.95, 1]")
) +
scale_x_continuous(breaks = seq(40, 160, 10)) +
labs(y = NULL, x = "Estimated Critical Value",
title = "Bootstrap distribution with median and 90% confidence interval") +
theme(axis.text.y = element_blank())``````
``Picking joint bandwidth of 3.21``
``````Warning: Using the `size` aesthetic with geom_segment was deprecated in ggplot2 3.4.0.

What do the estimates look like in the context of the original data?

``````plot_corr +
geom_pointdensity(data = cn_bootstraps,
aes(cstv, ry),
alpha = 0.5,
size = 2) +
scale_color_viridis_c(option = "H")``````

The critical soil test value for 95% RY appears to be 97 mg kg-1, and the 90% confidence interval is from 70 to 125 mg kg-1. Of concern, however, is the multimodality of the distribution. This seems to be a trait of the uncertainty around the critical value of the Cate-Nelson, at least when RY is fixed.

What if we employed the 1971 method without a fixed RY?

## Doing the Cate-Nelson Over and Over: Option 2

The first thing you’ll notice is how much slower this option is because we’re allowing iterations over the Y-values in addition to the X-values. So I’m cutting the bootstraps down considerably for demonstration purposes. Also, I’m showing another option for getting out results with `unnest()`. Instead of getting a single numeric value for which we would map to a double

``````fit_cn_71 <- function(data){
cstv <- cateNelson(
x = data\$x,
y = data\$y,
plotit = FALSE,
trend = "positive",
verbose = FALSE,
progress = FALSE)
return(cstv)
}

cn_71_cstv <- fit_cn_71(corr_data)\$CLx |> round()

# Must set seed to reproduce the same results every time
n_reps <- 100 # sometimes denoted as "R"

set.seed(8675309)
cn_bootstraps <-
corr_data %>%
bootstraps(times = n_reps) %>%
mutate(data = map(splits, analysis),
model = map(data, possibly(fit_cn_71, otherwise = NULL))) |>
unnest(model) |>
rename(cstv = CLx,
ry = CLy)

mean_ry <- mean(cn_bootstraps\$ry) |> round(1)``````

And now to reuse the previous code:

``````ci_90 <- quantile(cn_bootstraps\$cstv, c(0.05, 0.95), na.rm = TRUE) |> round()
ci_90_ry <- quantile(cn_bootstraps\$ry, c(0.05, 0.95), na.rm = TRUE) |> round(1)

ci_90``````
`````` 5% 95%
47  95 ``````
``ci_90_ry``
``````  5%  95%
67.9 85.3 ``````
``````cn_bootstraps %>%
ggplot(aes(cstv, y = 0,
fill = factor(after_stat(quantile)))) +
ggridges::stat_density_ridges(
quantile_lines = TRUE,
quantiles = c(0.05, 0.5, 0.95)) +
scale_color_manual(values = c("#CE1141", "#A0A0A088")) +
scale_fill_manual(
name = "Probability",
values = c("#CE1141", "#A0A0A088", "#A0A0A088", "#13274F"),
labels = c("(0, 0.05]", "(0.05, 0.5)", "(0.5, 0.95)", "(0.95, 1]")
) +
scale_x_continuous(breaks = seq(40, 160, 10)) +
labs(y = NULL, x = "Estimated Critical Value",
title = "Bootstrap distribution with median and 90% confidence interval") +
theme(axis.text.y = element_blank())``````
``Picking joint bandwidth of 5.21`` What do the estimates look like in the context of the original data?

``````plot_corr +
geom_pointdensity(data = cn_bootstraps,
aes(cstv, ry),
alpha = 0.5,
size = 2) +
scale_color_viridis_c(option = "H")``````

The critical soil test value is 48 mg kg-1, and the 90% confidence interval is from 47 to 95 mg kg-1. The associated relative yield with the critical values is 73.4 percent (90% CI from 67.9 to 85.3 percent). Again, check for multimodality in the distributions.

## References

Cate, Robert B., and Larry A. Nelson. 1971. “A Simple Statistical Procedure for Partitioning Soil Test Correlation Data into Two Classes.” Soil Science Society of America Journal 35 (4): 658–60. https://doi.org/10.2136/sssaj1971.03615995003500040048x.