Back
Featured image of post Fit quadratic-plateau models in R

Fit quadratic-plateau models in R

Agricultural yields go up a curve, then level off.

How does one fit a quadratic-plateau model in R?

Like this:

quad_plateau(data)

Or at least that is how it works once I explain a few things. Like my post on linear-plateau models in R, this post is for the person1 who has been googling how to fit quadratic-plateau models in R. This post provides three ways to accomplish this, and also shows off my convoluted creative function that helps one do quadratic-plateau fits and make a ggplot style figure all at once.

So first, get out your library card and borrow some packages.

library(tidyverse) # for data manipulation and plotting and so on
library(nlraa) # for self-starting models
library(minpack.lm)# alternative for nls model making
library(broom)     # for getting model info in a tidy way
library(modelr)    # for the r-squared and rmse
library(nlstools)  # for Wald confidence intervals
library(agridat)   # for the data

The Data

The data is from some cotton trials in the agridat package. Basically, more potassium measured in the soil should correlate to more cotton yield, but likely the relationship will not be linear. Fitting a model to determine something about this relationship is central to the soil-test correlation process. Let’s plot it to see what we’re working with.

cotton <- tibble(x = agridat::cate.potassium$potassium,
                 y = agridat::cate.potassium$yield)

cotton %>% 
  ggplot(aes(x, y)) +
  geom_point(size = 2, alpha = 0.5) +
  geom_rug(alpha = 0.5, length = unit(2, "pt")) +
  scale_x_continuous(breaks = seq(0, 1000, 20)) +
  scale_y_continuous(breaks = seq(0, 200, 10)) +
  labs(title = paste(
    "Relative yield and soil potassium for", nrow(cotton), "cotton trials"),
    y = "Relative yield (%)",
    x = "Concentration of Soil Potassium (ppm)")

The relationship between soil-test potassium (STK) and relative cotton yield may be nonlinear2. Perhaps a polynomial function could be fit, or the data could be transformed, but we’ll fit a non-linear model known as the quadratic-plateau. Quadratic-plateau (QP) models are also called quad-plats3. Quad-plats are good in that they have a curved component (important to biological systems) followed by a join point and a zero-slope plateau (convenient for researchers). Notice how the relative yield of the cotton does not appear to increase anymore from about 70-200 ppm K (the plateau!).

The QP Function

The equation for the QP is essentially that of a quadratic up until the join point: y = B0 + B1x + Bx2. While we could define the QP equation in the nls() function later, let’s make a function now.

# b0 = intercept
# b1 = slope
# b2 = quadratic term (curvy bit)
# jp = join point = critical concentration

qp <- function(x, b0, b1, jp) {
    b2 <- -0.5 * b1 / jp
    if_else(
        condition = x < jp,
        true  = b0 + (b1 * x) + (b2 * x * x),
        false = b0 + (b1 * jp) + (b2 * jp * jp)
    )
}

This function says that up until some join-point, the relationship is a second-order polynomial (quadratic), after which it hits a flat line (plateau). This model is sometimes also called quadratic-plus-plateau. That join-point is important. In the context of soil fertility it represents a critical concentration for the nutrient in the soil. The join point of a quad-plat will always be higher than with a linear-plateau, an important consideration for another time and place.

The qp function will be used inside the base R nls() function inside my quad_plateau() function later. So look for them there.

Method 1: base R

Using the nls() function in base R requires that we provide the formula (in this case the QP function), the data, and starting values. Starting values are critical4, and the computer doesn’t really know where to start, so poor starting values may result in failures of the model to converge as the algorithms working their tails off behind the scenes try to home in on the best values. It helps, though, to know that 1) the slope will be positive, 2) the quadratic term will therefore be negative, and 3) the join point will certainly not be less than your minimum soil test value (and ideally not more than your maximum).

Get starting values by guessing

Eyeballing that scatter plot, I’m trying to visualize in my mind’s eye a quadratic line coming up and bending over at a join point. The join point is easiest to imagine, and I’d guess it’s somewhere between 60 and 80 ppm K. Going to the left, that imaginary quadratic line looks like it might intercept the y-axis around 0. The slope of that line would be positive, and looks like maybe it’s around 2 (2% RY gain for every 1 ppm increase).

nls(formula = y ~ qp(x, b0, b1, jp),
    data = cotton,
    start = list(b0 = 0,
                 b1 = 2,
                 jp = 70))
## Nonlinear regression model
##   model: y ~ qp(x, b0, b1, jp)
##    data: cotton
##     b0     b1     jp 
## 12.864  1.912 86.197 
##  residual sum-of-squares: 2463
## 
## Number of iterations to convergence: 6 
## Achieved convergence tolerance: 4.329e-06

And it worked! The response curve is 12.8 + 1.91x - ??x2 up to 86 ppm K. Wait what is the quadratic term? It’s not in the model, but it is in the QP formula which is B2 = -0.5*B1/JP, or -0.011.

Get starting values with augmented guessing

Another way to estimate starting values is to mess around with sliders on a graphing calculator like Desmos. If the equation is a + bx - cx2, for example, then you can move the sliders and their value ranges until you get something you could imagine would go through your points. So zoom out until the x-axis and y-axis ranges match your dataset, then move sliders, then use those values in the starting values list like before.

Get starting values by fitting a quadratic polynomial

If you’re a sane person, you might decide to get starting values for a QP model by fitting a regular ol’ quadratic polynomial5. From this the intercept and slope can be estimated more precisely, and the join point can then be estimated by taking the mean or median of the soil test values. In fact, we could wrap all of these steps into a function.

fit_qp <- function(data) {
    # get starting values for nls
    start <- lm(y ~ poly(x, 2, raw = TRUE), data = data)
    # make model
    fit <- nlsLM(formula = y ~ qp(x, b0, b1, jp),
               data = data,
               start = list(b0 = start$coef[[1]],
                            b1 = start$coef[[2]],
                            jp = mean(data$x)))
    return(fit)
}

fit_qp(cotton)
## Nonlinear regression model
##   model: y ~ qp(x, b0, b1, jp)
##    data: data
##     b0     b1     jp 
## 12.865  1.912 86.198 
##  residual sum-of-squares: 2463
## 
## Number of iterations to convergence: 5 
## Achieved convergence tolerance: 1.49e-08

It worked again. Also, did you notice that I used nlsLM instead of nls in my function? More on that later.

Method 2: nlraa

Better than providing starting values to nls or nlsLM is to use a function that can “self-start”. The nlraa package provides just that, a self-starting function for a quadratic-plateau model. There are two functions available, but the one with 4 parameters seems to have issues, so let’s just use the SSquadp3 three parameter function.

nls(formula = y ~ SSquadp3(x, b0, b1, b2),
    data = cotton)
## Nonlinear regression model
##   model: y ~ SSquadp3(x, b0, b1, b2)
##    data: cotton
##       b0       b1       b2 
## 12.86429  1.91241 -0.01109 
##  residual sum-of-squares: 2463
## 
## Number of iterations to convergence: 8 
## Achieved convergence tolerance: 1.441e-06

Flawless. This function optimizes the starting values through some sort of searching 6. One difference though is that it doesn’t give the join-point. Instead, calculate as JP = -0.5 * B1 / B2. One downside with not getting a join-point in the model is that confidence intervals around it cannot be estimated as quickly.

Method 3: Convoluted way

Next is my big quad_plateau()7 that fits the QP model and either gets useful numbers into a table or plots it visually. Getting results into a table means that this function is compatible with tidyverse/tidymodels style work, usable on grouped data frames and so on. Getting results into a plot could be done with a separate function, but why not cram it all into one? Here are the steps of the function:

  1. Get starting values for the nlsLM() function by fitting a quadratic polynomial.

  2. Run nlsLM()8 with the fit_qp function (defined earlier).

  3. Test goodness-of-fit with R2, RMSE, and model error.

  4. Get model coefficients for response curve equation.

  5. Get a confidence interval for the join-point (critical soil test value).

  6. Optional: look at standard error and p-values for the coefficients, though these may not matter much.

  7. Optional: get the soil test value at some point below the join-point, such as 95% of the maximum yield plateau.

  8. Create a table of results.

    OR

  9. Plot the cotton yield data with a regression line using nlsLM and the model final values as starting values (which all ensure identical results).

quad_plateau <- function(data,
                         alpha = 0.05,
                         confidence = 95,
                         percent_of_max = 95,
                         plot = FALSE) {
    
    # uses the fit_qp function defined earlier
    try(corr_model <- fit_qp(data))
    
    # How did the model do overall?
    rsquared <- round(modelr::rsquare(corr_model, data), 2)
    rmse <- round(modelr::rmse(corr_model, data), 2)
    model_error <- round(summary(corr_model)$sigma, 2)
    
    # get model coefficients
    b0 <- tidy(corr_model)$estimate[1]
    b1 <- tidy(corr_model)$estimate[2]
    # join point of segmented regression = critical soil test value
    jp <- tidy(corr_model)$estimate[3]
    b2 <- -0.5 * b1 / jp
    
    # equation from QP function
    plateau <- round(b0 + (b1 * jp) + (b2 * jp * jp), 1)
    
    equation <- paste0(round(b0, 1), " + ",
                       round(b1, 2), "x + ",
                       round(b2, 3), "xx")
    
    # get wald confidence interval of join point at 100% 
    ci <- nlstools::confint2(corr_model, level = confidence / 100)
    lower_cl <- round(ci[3, 1], 0)
    upper_cl <- round(ci[3, 2], 0)
    
    # CSTV at defined % of max/plateau (use percent_of_max argument)
    adjust_cstv <- function(b0, b1, b2) {
        jp <- -0.5 * b1 / b2
        newplateau <- (b0 + (b1 * jp) + (b2 * jp * jp)) * percent_of_max / 100
        newb0 <- b0 - newplateau
        discriminant <- (b1 ^ 2) - (4 * newb0 * b2)
        
        if (discriminant < 0) {
            return(NA)
        }
        else if (discriminant > 0) {
            cstv_adj <- (-b1 + sqrt(discriminant)) / (2 * b2)
            
            return(cstv_adj)
        }
    }
    
    cstv_adj <- round(adjust_cstv(b0, b1, b2), 0)
    
    # Printouts
    if (plot == FALSE) {
        tibble(
            equation,
            plateau,
            join_point = round(jp, 0),
            lower_cl,
            upper_cl,
            rsquared,
            rmse,
            model_error,
            AIC = round(AIC(corr_model),0)
        )
    } else {
        # Residual plots and normality
        nls_resid <- nlstools::nlsResiduals(corr_model)
        plot(nls_resid, which = 0)
        predicted <- nlraa::predict_nls(corr_model,
                                 newdata = data,
                                 interval = "confidence") %>%
            as_tibble() %>% 
            bind_cols(data)
        
        # ggplot of data
        predicted %>%
            ggplot(aes(x, y)) +
            geom_ribbon(aes(y = Estimate,
                            ymin = Q2.5,
                            ymax = Q97.5),
                        alpha = 0.05) +
            ggpp::geom_vhlines(xintercept = jp,
                               yintercept = plateau,
                               alpha = 0.5,
                               linetype = 3) +
            ggpp::geom_vhlines(xintercept = cstv_adj,
                               yintercept = plateau * percent_of_max / 100,
                               alpha = 0.3,
                               linetype = 3) +
            geom_line(
                stat="smooth",
                method = "nlsLM",
                formula = y ~ qp(x, b0, b1, jp),
                method.args = list(start = as.list(coef(corr_model))),
                se = FALSE,
                color = "#CC0000"
            ) +
            geom_point(size = 2, alpha = 0.5) +
            geom_rug(alpha = 0.2, length = unit(2, "pt")) +
            annotate("text", 
                label = paste0("y = ", equation,
                              "\nR-squared = ", rsquared,
                              "\nRMSE = ", rmse,
                              "\nCSTV at ", round(plateau, 1), "% is ", round(jp, 0)," ppm (CI ", lower_cl, "-", upper_cl, ")",
                              "\nAdjusted CSTV at ", percent_of_max, "% of max = ", cstv_adj, " ppm"),
                x = max(data$x) * 0.6,
                y = quantile(data$y, 0.10),
                hjust = 0,
                family = rc
            ) +
            labs(x = "Soil test K (ppm)",
                 y = "Relative yield (%)",
                 caption = paste("Each point is a site. n =", nrow(data)))
    }
    
}

Convoluted result

So how do you fit a quadratic-plateau model in R? Like so:

quad_plateau(cotton)
## # A tibble: 1 x 9
##   equation plateau join_point lower_cl upper_cl rsquared  rmse model_error   AIC
##   <chr>      <dbl>      <dbl>    <dbl>    <dbl>    <dbl> <dbl>       <dbl> <dbl>
## 1 12.9 + ~    95.3         86       45      127     0.68  10.1        10.8   187

The model is written out in a data frame, ready for other use and manipulation. This is really helpful if for example you have a grouped data frame of different fertilizer trials, and want to perform a QP fit to each group using something like group_modify().

If you want to plot it along with an overlay of model results, use the plot = TRUE argument and add your labels. Use group_map() if you want to “loop” this over multiple groups in your data frame.

quad_plateau(cotton, plot = TRUE)

The critical soil concentration for potassium appears to be around 86 ppm, higher than the linear-plateau model. The confidence interval for the join-point is also wider than the LP model. In fact, the confidence interval is just wide (45-127!). This problematic width is partially due to getting the confidence limits at the join point, rather than at a lower point such as 99 or 95% of the maximum yield plateau. But in the case of this dataset, I’d say the problem has to do with data availability. There are gaps in the data, including a hole between 40-80 ppm STK and 65-90% RY. Basically the model lacks confidence in the fit approaching the join point, and this is reflected in the wide gray RY confidence band shown around the red line in the plot. The experimenter needs more data in that gap.

Now, what if the two points with “high” STK (> 180 ppm), were removed? How would that change the analysis? Go back and test yourself9. How do other models compare? Linear-plateau? Mitscherlich-Bray?

More resources about nonlinear functions in R

  • rcompanion package - gave me a good jumpstart and has some useful functions inside

  • nlstools package

  • nlraa package by Fernando Miguez

  • easynls package - it’s quick and dirty, but not as useful as it could be

  • nls.multstart package - tries lots of starting values to try and better get at the best values for the equation. But I think for QP models, my approach works fine and using multiple starting values is unnecessary.

Bonus: Quick figure with QP line

What if you just want a quick figure of a QP fitted to your scatter plot points, and do not care so much about the model results and information, etc? Using the self-starting function from nlraa, try the following:

cotton %>% 
  ggplot(aes(x, y)) +
  geom_point(size = 2,
             alpha = 0.5) +
  geom_line(stat="smooth",
            method = "nlsLM",
            formula = y ~ SSquadp3(x, b0, b1, jp),
            se = FALSE,
            color = "#CC0000") +
  labs(title = paste(
    "Relative yield and soil potassium for", nrow(cotton), "cotton trials"),
    y = "Relative yield (%)",
    x = "Concentration of Soil Potassium (ppm)")

Congratulations! You are now certified to quad-plat in R.


  1. Probably a grad student↩︎

  2. No one will judge you if you hold a clear, plastic ruler up to your computer screen.↩︎

  3. This isn’t true, but I wish it was.↩︎

  4. What if the starting values result in a failure of converge? Check out these tips from Dr. Miguez.↩︎

  5. Also known as a quadrapoly.↩︎

  6. The trade-off is that it takes a little longer to run than my fit_qp function. You probably won’t notice a difference unless you are running it 1000 times in a bootstrapping analysis or unless you are super fast with a stopwatch. The reliability of the SSquadp3 function can be worth the trade-off.↩︎

  7. I’m not a function wizard, but I’m trying, so forgive any poor coding practices.↩︎

  8. nlsLM() seems more reliable overall than plain nls() from base R. I’ve had decent luck with nls(), but sometimes it fails. nlsLM() seems to fail less, probably due to using a different algorithm. You can test their similarity in reaching the same final results by using the output of nlsLM() as starting values in nls().↩︎

  9. Hint: filter()↩︎

Built with Hugo
Theme Stack designed by Jimmy