Featured image of post Fit linear-plateau models in R

Fit linear-plateau models in R

Agricultural yields go up a line, then level off.

How does one fit a linear-plateau model in R?

This post is intended for the person1 who has been googling how to fit linear-plateau models in R. This post demonstrates three ways to fit a linear-plateau model, and presents an unnecessary unique function that sort of helps one do linear-plateau fits and make a ggplot style figure all at once. So first, load some packages.

library(tidyverse) # for data manipulation and plotting and so on
library(nlraa) # for self-starting models
library(minpack.lm)# alternative for nls model making
library(broom)     # for getting model info in a tidy way
library(modelr)    # for the r-squared and rmse
library(nlstools)  # for Wald confidence intervals
library(agridat)   # for the data

The data

The data is from some cotton trials in the agridat package. Basically, more potassium measured in the soil should correlate to more cotton yield, but the relationship will not be linear. Fitting a model to determine something about this relationship is central to the soil-test correlation process. Let’s plot it.

cotton <- tibble(x = agridat::cate.potassium$potassium,
                 y = agridat::cate.potassium$yield)

cotton %>% 
  ggplot(aes(x, y)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_rug(alpha = 0.5, length = unit(2, "pt")) +
  scale_x_continuous(breaks = seq(0, 1000, 20)) +
  scale_y_continuous(breaks = seq(0, 200, 10)) +
  labs(title = paste(
    "Relative yield and soil potassium for", nrow(cotton), "cotton trials"),
    y = "Relative yield (%)",
    x = "Concentration of Soil Potassium (ppm)")

The relationship between soil-test potassium (STK) and relative cotton yield may be nonlinear2. Perhaps a polynomial function could be fit, or the data could be transformed, but we’ll fit a non-linear model known as the linear-plateau. Linear-plateau (LP) models are also called lin-plats3. Lin-plats are good in that they are simpler than quadratic-plateaus and have a well-defined join point. Notice how the relative yield of the cotton does not appear to increase anymore from about 70-200 ppm K (the plateau!).

The LP function

The equation for the LP is a line up until the join point: y = B0 + B1x. While we could define the LP equation in the nls() function later, let’s make a function now.

# b0 = intercept
# b1 = slope
# jp = join point = critical concentration
lp <- function(x, b0, b1, jp) {
      condition = x < jp,
      true = b0 + b1 * x,
      false = b0 + b1 * jp)

This function says that up until some join-point, the relationship is linear, after which the stick breaks and it is a flat line (plateau). Sometimes this model is called broken-stick or linear-plus-plateau. That join-point is important; it represents a critical soil test value for the nutrient in the soil. The join point of a linear-plateau will always be lower than with a quadratic-plateau, an important consideration for another time and place.

Method 1: base R

Using the nls() function in base R requires that we provide the formula (in this case the LP function), the data, and starting values. Starting values are critical, and the computer doesn’t really know where to start, so poor starting values may result in failures of the model to converge as the algorithms working their tails off behind the scenes try to home in on the best values. It helps, though, to know that 1) the slope will be positive and 2) the join point will certainly not be less than your minimum soil test value (and ideally not more than your maximum).

Get starting values by guessing

Eyeballing that scatter plot, I’m trying to visualize in my mind’s eye a straight line coming up and breaking over at a join point. The join point is easiest to imagine, and I’d guess it’s somewhere between 60 and 80 ppm K. Going to the left, that imaginary line looks like it might intercept the y-axis around 20-30, maybe? The slope of that line would be positive, and looks like maybe it’s around 2-ish (2% RY gain for every 1 ppm increase).

nls(formula = y ~ lp(x, b0, b1, jp),
    data = cotton,
    start = list(b0 = 20,
                 b1 = 2,
                 jp = 70))
## Error in nls(formula = y ~ lp(x, b0, b1, jp), data = cotton, start = list(b0 = 20, : step factor 0.000488281 reduced below 'minFactor' of 0.000976562

And it failed4. It may be that nls() is being too picky, but it could be that out data is a bit noisy/gappy around the values we are guessing the join point would be.

Before trying another method to get starting values, let’s try nlsLM() first from the minpack.lm package, which uses a fancier algorithm.

nlsLM(formula = y ~ lp(x, b0, b1, jp),
      data = cotton,
      start = list(b0 = 20,
                   b1 = 2,
                   jp = 70))
## Nonlinear regression model
##   model: y ~ lp(x, b0, b1, jp)
##    data: cotton
##      b0      b1      jp 
## 39.8156  0.7411 75.0000 
##  residual sum-of-squares: 2569
## Number of iterations to convergence: 19 
## Achieved convergence tolerance: 1.49e-08

And it worked! The response equation is y = 39.8 + 0.74x up to 75 ppm K.

Get starting values by fitting a line

If you’re a sane person, you might decide to get starting values for a LP model by fitting a regular ol’ line. From this the intercept and slope can be estimated, and the join point can then be estimated by taking the mean or median of the soil test values. In fact, we could wrap all of these steps into a function.

fit_lp <- function(data) {
    # get starting values for nls
    start <- lm(y ~ x, data = data)
    # make model
    fit <- nlsLM(formula = y ~ lp(x, b0, b1, jp),
                 data = data,
                 start = list(b0 = start$coef[[1]],
                              b1 = start$coef[[2]],
                              jp = mean(data$x)))

## Nonlinear regression model
##   model: y ~ lp(x, b0, b1, jp)
##    data: data
##     b0     b1     jp 
## 19.345  1.331 56.000 
##  residual sum-of-squares: 2411
## Number of iterations to convergence: 16 
## Achieved convergence tolerance: 1.49e-08

It worked again, but notice that the response equation now is 19.3 + 1.33x up to 56 ppm K. Quite a bit different perhaps just by using different starting values. Using a self-starting function that uses an optimization algorithm to get starting values might be a more precise and reliable way to go. Let’s try that.

Method 2: nlraa

Better than providing starting values to nls or nlsLM is to use a function that can “self-start”. The nlraa package provides just that, a self-starting function for a linear-plateau model.

fit_SSlp <- function(data) {
    # nlraa model
    fit <- nlsLM(
        formula = y ~ SSlinp(x, b0, b1, jp),
        data = data)

## Nonlinear regression model
##   model: y ~ SSlinp(x, b0, b1, jp)
##    data: data
##      b0      b1      jp 
## 39.2364  0.7518 75.0000 
##  residual sum-of-squares: 2569
## Number of iterations to convergence: 8 
## Achieved convergence tolerance: 1.49e-08

Flawless. This function optimizes the starting values through some sort of searching 5, and the resulting equation is y = 39.2 + 0.75x up to 75 ppm K, which is similar to the first.

Method 3: Convoluted way

Next is my lin_plateau()6 function that fits the LP model and either gets useful numbers into a table or plots it visually. Getting results into a table means that this function is compatible with tidyverse/tidymodels style work, usable on grouped data frames and so on. Getting results into a plot could be done with a separate function, but why not cram it all into one? Here are the steps of the function:

  1. Use a self-starting function within nlsLM().

  2. Test goodness-of-fit with R2, RMSE, and model error.

  3. Get model coefficients for response curve equation.

  4. Get a confidence interval for the join-point (critical soil test value).

  5. Optional: look at standard error and p-values for the coefficients, though these may not matter much.

  6. Optional: get the soil test value at some point below the join-point, such as 95% of the maximum yield plateau.

  7. Create a table of results.


  8. Plot the cotton yield data with a regression line using nlsLM and the model final values as starting values (which all ensure identical results).

# run model and graph all in one!
lin_plateau <- function(data,
                        confidence = 95,
                        alpha = 0.05,
                        percent_of_max = 95,
                        plot = FALSE) {
    # linear plateau model
    corr_model <- fit_SSlp(data)

     # Find p-value and pseudo R-squared
    rsquared <- round(modelr::rsquare(corr_model, data), 2)
    rmse <- round(modelr::rmse(corr_model, data), 2)
    model_error <- round(summary(corr_model)$sigma, 2)
    # get wald confidence interval of join point at 100% 
    ci <- nlstools::confint2(corr_model, level = confidence / 100)
    lower_cl <- round(ci[3, 1], 0)
    upper_cl <- round(ci[3, 2], 0)
    # get model coefficients
    b0 <- tidy(corr_model)$estimate[1]
    b1 <- tidy(corr_model)$estimate[2]
    jp <- tidy(corr_model)$estimate[3]
    # CSTV at defined % of max/plateau (use percent_of_max argument)
    plateau <- b0 + b1 * jp
    cstv_adj <- round(((plateau * percent_of_max / 100) - b0) / b1, 0)
    # have to make a line because the SSlinp doesn't plot right
    lp_line <- tibble(x = c(min(data$x), jp, max(data$x)),
                      y = c(b0 + b1 * min(data$x), plateau, plateau))
    equation <- paste0(round(b0, 1), " + ",
                       round(b1, 2), "x")
    # Table output =====
    if (plot == FALSE) {
            plateau = round(plateau, 1),
            join_point = round(jp, 0),
            AIC = round(AIC(corr_model),0)
    } else {
        # Residual plots and normality
        nls_resid <- nlstools::nlsResiduals(corr_model)
        plot(nls_resid, which = 0)
        predicted <- nlraa::predict2_nls(corr_model,
                                 newdata = data,
                                 interval = "confidence") %>%
            as_tibble() %>% 
        predicted %>%
            ggplot(aes(x, y)) +
            geom_ribbon(aes(y = Estimate,
                            ymin = Q2.5,
                            ymax = Q97.5),
                            alpha = 0.05) +
            ggpp::geom_vhlines(xintercept = jp,
                               yintercept = plateau,
                               alpha = 0.2,
                               linetype = 3) +
            ggpp::geom_vhlines(xintercept = cstv_adj,
                               yintercept = plateau * percent_of_max / 100,
                               alpha = 0.3,
                               linetype = 3) +
                label = paste0("y = ", equation,
                    "\nr-squared = ", rsquared,
                    "\nrmse = ", rmse,
                    "\nCSTV at ", round(plateau, 1), "% is ", 
                    round(jp, 0)," ppm (CI ",lower_cl,"-",upper_cl,")",
                    "\nAdjusted CSTV at ", percent_of_max, "% of max = ", cstv_adj, " ppm"),
                x = max(data$x) * 0.7,
                y = quantile(data$y, 0.10),
                hjust = 0,
                family = rc) +
            geom_point(size = 2, alpha = 0.5) +
            geom_rug(alpha = 0.2, length = unit(2, "pt")) +
            geom_line(data = lp_line,
                      aes(x = x, y = y),
                      color = "#CC0000") +
            labs(x = "Soil test K (ppm)",
                 y = "Relative yield (%)",
                 caption = paste("Each point is a site. n =", nrow(data)))

Convoluted result

So how do you fit a linear-plateau model in R? Like so:

## # A tibble: 1 x 9
##   equation plateau join_point lower_cl upper_cl rsquared  rmse model_error   AIC
##   <chr>      <dbl>      <dbl>    <dbl>    <dbl>    <dbl> <dbl>       <dbl> <dbl>
## 1 39.2 + ~    95.6         75       46      104     0.66  10.4        11.1   188

The model is written out in a data frame, ready for other use and manipulation. This is really helpful if for example you have a grouped data frame of different fertilizer trials, and want to perform an LP fit to each group using something like group_modify().

If you want to plot it along with an overlay of model results, use the plot = TRUE argument and add your labels. Use group_map() if you want to “loop” this over multiple groups in your data frame.

lin_plateau(cotton, plot = TRUE)

The critical soil concentration for potassium appears to be around 75 ppm K. The Wald confidence interval is pretty wide (46-104), though not as wide if fitting a quadratic-plateau. This problematic width is partially due to getting the confidence limits at the join point, rather than at a lower point such as 99 or 95% of the maximum yield plateau. But in the case of this dataset, I’d say the problem has to do with data availability. There are gaps in the data, including a hole between 40-80 ppm STK and 65-90% RY. Basically the model lacks confidence in the fit approaching the join point, and this is reflected in the wide gray RY confidence band shown around the red line in the plot. The experimenter needs more data in that gap.

Now, what if the two points with “high” STK (> 180 ppm), were removed? How would that change the analysis? Go back and test yourself7. How do other models compare? Quadratic-plateau? Mitscherlich-Bray?

More resources about nonlinear functions in R

  • rcompanion package - gave me a good jumpstart and has some useful functions inside

  • nlstools package

  • nlraa package by Fernando Miguez

  • easynls package - it’s quick and dirty, but not as useful as it could be

  • nls.multstart package - tries lots of starting values to try and better get at the best values for the equation. But I think for QP models, my approach works fine and using multiple starting values is unnecessary.

Bonus: Quick figure with LP line

What if you just want a quick figure of an LP fitted to your scatter plot points, and do not care so much about the model results and information, etc? Using the self-starting function from nlraa, try the following:

cotton %>% 
  ggplot(aes(x, y)) +
  geom_point(size = 2,
             alpha = 0.5) +
            method = "nlsLM",
            formula = y ~ SSlinp(x, b0, b1, jp),
            se = FALSE,
            color = "#CC0000") +
  labs(title = paste(
    "Relative yield and soil potassium for", nrow(cotton), "cotton trials"),
    y = "Relative yield (%)",
    x = "Concentration of Soil Potassium (ppm)")

Congratulations! You are now certified to lin-plat in R.

  1. nlsLM() is more reliabel overall than plain nls(). I’ve had pretty good luck with nls(), but sometimes it fails. nlsLM() seems to fail less. You can test their similarity by using a, b, c coefficients in the output of nlsLM() as inputs to nls().↩︎

  2. Hint: filter()↩︎

  3. This isn’t true, but I wish it was.↩︎

  4. What if the starting values result in a failure of converge? Check out these tips from Dr. Miguez.↩︎

  5. The trade-off is that it takes a little longer to run than my fit_lp function. You probably won’t notice a difference unless you are running it 1000 times in a bootstrapping analysis or unless you are super fast with a stopwatch. The reliability of the SSlinp function can be worth the trade-off.↩︎

  6. I’m not a function wizard, but I’m trying, so forgive any poor coding practices.↩︎

  7. Hint: filter()↩︎

Built with Hugo
Theme Stack designed by Jimmy