Bootstrap confidence intervals on a linear-plateau model in R

Crop yields go up a line, then level off at some uncertain point.

agriculture

statistics

modeling

soil chemistry

soil testing

soiltestcorr

R

Published

July 26, 2022

Modified

July 26, 2022

This post is about estimating uncertainty on a nonlinear model parameter or outcome. In the context of soil test correlation, this parameter or model outcome could represent a critical soil test value.

library(agridat) # for the datalibrary(tidyverse) # for data manipulation and plotting and so onlibrary(nlraa) # for self-starting models

Warning: S3 method 'print.boot' was declared in NAMESPACE but not found

library(minpack.lm)# for nls with Levenberg-Marquardt algorithmlibrary(rsample) # for bootstrapslibrary(nlstools) # for residual plotslibrary(modelr) # for the r-squared and rmselibrary(devtools) # for sourcing the lin_plateau() function in Method 3

The Data

The dataset Iâ€™ll use can be accessed through the agridat package. In this soil test correlation dataset, more potassium measured in the soil should correlate to more cotton yield, but the relationship will most likely not be linear. Letâ€™s plot it to see what weâ€™re working with.

The relationship between soil test potassium (STK) and relative cotton yield is not linear. Perhaps a polynomial function could be fit, or the data could be transformed, but weâ€™ll fit a nonlinear model known as the linear-plateau (LP), or lin-plat^{1}. Looking at the plot, notice how the relative yield of the cotton does not appear to increase anymore from about 70-200 ppm K (the plateau!). Scroll to the bottom of this page if you struggle to visualize this.

^{1} This isnâ€™t true, but I wish it was.

Basic method of fitting LP

Letâ€™s make a function for the LP model to be used with nls() in base R.

# a = intercept# b = slope# cx = critical X value = join/break point# lp <-function(x, a, b, cx) {if_else(condition = x < cx,true = a + b * x,false = a + b * cx)}

This function says that up until some join point, the relationship is linear, after which the stick breaks and it is a flat line (plateau). Using the nls() function in base R requires that we provide starting values for each of the model parameters. Starting values are critical, and the computer doesnâ€™t really know where to start, so poor starting values may result in failures of the model to converge as the algorithms working their tails off behind the scenes try to home in on the best values. It helps, though, to keep in mind that with this sort of data we can expect:

the intercept will not be greater than the maximum Y value

the slope will be positive

the join point will not be less than your minimum X value (and, ideally, will be less than than your maximum^{2}).

^{2} If the join point is equal to or greater than your maximum X value, then you have not fitted a LP model. You have fitted a regular, olâ€™ fashioned, straight line.

fit <-nls(formula = y ~lp(x, a, b, jp),data = cotton,start =list(a =30,b =1,jp =70))

Error in nls(formula = y ~ lp(x, a, b, jp), data = cotton, start = list(a = 30, : step factor 0.000488281 reduced below 'minFactor' of 0.000976562

summary(fit)

Error in summary(fit): object 'fit' not found

And it failed^{3}. If we try running this again but include trace = TRUE as an additional argument, we see that nls() got close to finishing, but it could be it failed to converge because the data is too noisy/gappy around the values we are guessing the join point.

^{3} What if the starting values result in a failure of converge? Check out these tips from Dr. Miguez.

Before trying another method to get starting values, letâ€™s try nlsLM() first from the minpack.lm package, which uses a different algorithm.

LP_mod <- minpack.lm::nlsLM(formula = y ~lp(x, a, b, cx),data = cotton,start =list(a =30,b =1,cx =70))summary(LP_mod)

Formula: y ~ lp(x, a, b, cx)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 39.8388 9.4844 4.200 0.000402 ***
b 0.7408 0.2077 3.567 0.001822 **
cx 75.0000 10.8467 6.915 7.85e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.06 on 21 degrees of freedom
Number of iterations to convergence: 17
Achieved convergence tolerance: 1.49e-08

Worked fine. Now to take this and put it into a framework for bootstrapping the tidy way. First, make a function that can fit the LP on the analysis portion of the data split created with the rsample package. Then create the bootstraps, and fit the LP model to each. But what if a model fails? The purrr package includes possibly() which Iâ€™ve used to produce a NULL result if the nls() step fails to converge.

fit_LP <-function(split) { fit <-nlsLM(formula = y ~lp(x, a, b, cx),data =analysis(split),start =as.list(coef(LP_mod)))return(fit)}# Must set seed to reproduce the same results every timeset.seed(911)lp_bootstraps <- cotton %>%bootstraps(times =100) %>%mutate(models =map(splits, possibly(fit_LP, otherwise =NULL)))lp_bootstraps

Next, we have two pathways to getting the 95% confidence interval for the critical X parameter (cx). For each, weâ€™ll need to get the model coefficients with broom::tidy(). Then we can unnest the coefs, pivot, and calculate the CI with quantile(), OR we can skip the unnesting and use the nifty interval percentile function int_pctl() from rsample.

Warning: Recommend at least 1000 non-missing bootstrap resamples for terms: `a`,
`b`, `cx`.

# A tibble: 3 Ã— 6
term .lower .estimate .upper .alpha .method
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a -6.37 30.3 57.1 0.05 percentile
2 b 0.398 1.01 1.88 0.05 percentile
3 cx 50.0 71.8 104. 0.05 percentile

Pretty cool. Did you notice the warning message? With bootstrap resampling techniques, more is better (at least to a point, and depending on your patience). The 1000 mentioned in the warning message is a standard recommendation; my code only used 100. So letâ€™s see if/how our 95% confidence intervals change if we doâ€¦.2000!

# A tibble: 3 Ã— 6
term .lower .estimate .upper .alpha .method
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 a -9.26 29.7 55.2 0.05 percentile
2 b 0.426 1.02 2.15 0.05 percentile
3 cx 49.0 71.4 102. 0.05 percentile

So the confidence intervals for the cx parameter didnâ€™t change much.

Bootstrap confidence intervals with a plot!

Iâ€™ve added these bootstrap steps into the lin_plateau() function described in my other post. Letâ€™s get that function.

devtools::source_url("https://raw.githubusercontent.com/austinwpearce/SoilTestCocaCola/main/lin_plateau.R")# run `lin_plateau` without parentheses in R console to see source code

By using the confint = TRUE argument, we can get an estimate of the uncertainty around the join point parameter of the linear-plateau. As a default, only 500 bootstrap replicates (boot_R) are used, for a quick estimate of the 95% CI. Bump that up to 1000 or more at your pleasure.

lin_plateau(cotton, x, y, confint =TRUE)

Warning: Recommend at least 1000 non-missing bootstrap resamples for terms: `a`,
`b`, `cx`.

lin_plateau(cotton, x, y, confint =TRUE, plot =TRUE)

Warning: Recommend at least 1000 non-missing bootstrap resamples for terms: `a`,
`b`, `cx`.

The critical value for soil test K appears to be 75 mg kg^{-1} K, with a 95% CI from about 49 to 105 mg kg^{-1} K. Now why do we go to all this trouble with bootstrap confidence intervals? Because we cannot assume that the uncertainty around the join point is normally distributed. This could be visualized by plotting all the join points from all the bootstraps (example from another dataset below). With nonlinear models, the distribution of a parameter is often not normal.