Back
Featured image of post Fit linear-plateau models in R

Fit linear-plateau models in R

Agricultural yields go up a line, then level off.

The rcompanion package is the main reason I figured out how to fit non-linear models to agricultural data. Go reference it here. I’ll wait.

This post provides minimal explanations, and is intended for the person who has been googling how to fit linear-plateau models in R. This post also provides the function within another function that sort of helps one do linear-plateau fits and make a ggplot style figure all at once. So first, load some packages.

library(tidyverse)
library(rcompanion)
library(minpack.lm)
library(nlstools)
library(agridat)

The data

The data is from some cotton trials in the agridat package. Basically, more potassium measured in the soil should correlate to more cotton yield, but the relationship will not be linear. Fitting a model to determine something about this relationship is central to the soil-test correlation process. Let’s plot it.

cotton <- agridat::cate.potassium

cotton %>% 
  ggplot(aes(x = potassium, y = yield)) +
  geom_point(size = 1, alpha = 0.5) +
  geom_rug(alpha = 0.5, length = unit(2, "pt")) +
  scale_x_continuous(breaks = seq(0, 1000, 20)) +
  scale_y_continuous(breaks = seq(0, 200, 5)) +
  labs(title = paste(
    "Relative yield and soil potassium for", nrow(cotton), "cotton trials"),
    y = "Relative yield (%)",
    x = "Concentration of Soil Potassium (ppm)")

The relationship between soil-test potassium (STK) and relative cotton yield may be non-linear. Perhaps a polynomial function could be fit, or the data could be transformed, but we’ll fit a non-linear model known as the linear-plateau.

## Linear-plateau model
lp <- function(x, a, b, c) {
    if_else(condition = x < c,
            true = a + b * x,
            false = a + b * c)
}

This function says that up until some join-point, the relationship is linear, after which the stick breaks and it is a flat line (plateau). Sometimes this model is called broken-stick or linear-plus-plateau. That join-point is important; it represents a critical concentration for the nutrient in the soil.

Below is my big function to fit it, get some useful numbers, and plot it. I’m not a function wizard, but I’m trying, so forgive any poor coding practices. That’s why in my function, I’ve just put in the variables yield and potassium,

What is going on in the following ‘function’?

  1. Get initial values for the nlsLM() function coming up.
  • this can be done by fitting a linear model, and yanking out coefficients.
  1. Run nlsLM()1 with the lp() model as part of the formula.
  2. Look at your model summary.
  3. Make a null model, and get a pseudo-R2 value.
  4. Get confidence interval for the join-point.
  5. Look at residuals.
  6. Get coefficients from model for use in ggplot.
  7. Make a line from coefficients, and plot it with the cotton yield data.
# run model and graph all in one!
lin_plateau <- function(data) {
    # initial values
    ini_fit <- lm(data = data, formula = yield ~ potassium)
    ini_a <- ini_fit$coef[[1]]
    ini_b <- ini_fit$coef[[2]]
    ini_c <- mean(data$potassium)
    
    # linear plateau model
    lp_model <<- nlsLM(
        formula = yield ~ lp(potassium, a, b, c),
        data = data,
        start = list(a = ini_a, b = ini_c, c = ini_c)
    )

    print(summary(lp_model))
    
    # Define null model
    m.ini <- mean(data$yield)
    
    nullfunct <- function(x, m) {
    m
}
    null <- nls(yield ~ nullfunct(potassium, m),
                data = data,
                start = list(m = m.ini),
                trace = FALSE,
                nls.control(maxiter = 1000))
    
    # Find p-value and pseudo R-squared
    pseudo_r2 <- nagelkerke(lp_model, null)[2]
    print(pseudo_r2)
    
    # print and set confidence interval
    confinterval <- confint2(lp_model, level = 0.95)
    conflo <- round(confinterval[3, 1], 1)
    confhi <- round(confinterval[3, 2], 1)
    print("Confidence Interval")
    print(confinterval)
    
    plotNormalHistogram(residuals(lp_model))
    plot(fitted(lp_model), residuals(lp_model))
    
    a  <- coef(lp_model)[[1]]
    b  <- coef(lp_model)[[2]]
    cc <- coef(lp_model)[[3]]
    plateau <- round(a + b * cc, 1)
    
    print(paste("Plateau at ", plateau, "% yield"))
    
    # other plot
    data %>%
      ggplot(aes(x = potassium, y = yield)) +
      geom_point(size = 1, alpha = 0.5) +
      geom_rug(alpha = 0.5, length = unit(2, "pt")) +
      geom_vline(xintercept = cc,
                 color = "#CC0000") +
      geom_vline(
        xintercept = c(conflo, confhi),
        color = "#CC0000",
        alpha = 0.2
      ) +
      geom_smooth(method = "nlsLM",
                formula = y ~ lp(x, a, b, c),
                method.args = list(start = as.list(coef(lp_model))),
                se = FALSE,
                color = "#4156a1",
                size = 0.5) +
      scale_x_continuous(breaks = seq(0, 1000, 20)) +
      scale_y_continuous(breaks = seq(0, 200, 5)) +
      labs(subtitle = paste0("Critical concentraion = ", round(cc, 0),
        " ppm; 95% CL [", conflo, ", ", confhi, "]"))
}

The result

So how do you fit a linear-plateau model in R? Like this:

lin_plateau(cotton) +
  labs(title = paste(
    "Relative yield and soil potassium for", nrow(cotton), "cotton trials"),
    y = "Relative yield (%)",
    x = "Concentration of Soil Potassium (ppm)")
## 
## Formula: yield ~ lp(potassium, a, b, c)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a  39.3394     9.4833   4.148 0.000456 ***
## b   0.7489     0.2077   3.606 0.001659 ** 
## c  75.0000    10.7277   6.991 6.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.06 on 21 degrees of freedom
## 
## Number of iterations to convergence: 13 
## Achieved convergence tolerance: 1.49e-08
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                             0.126276
## Cox and Snell (ML)                   0.662275
## Nagelkerke (Cragg and Uhler)         0.662397
## 
## [1] "Confidence Interval"
##        2.5 %    97.5 %
## a 19.6177161 59.061050
## b  0.3170364  1.180782
## c 52.6905311 97.309472

## [1] "Plateau at  95.5 % yield"

ggsave("static/linear-plateau.png",
       dpi = 300, width = 7, height = 5)

The critical soil concentration for potassium appears to be around 75 ppm, though the confidence interval is pretty wide. This is probably due to the lack of data points. The model summary shows that most of the error comes from the the a and c parameters, essentially the intercept and join-point. The psuedo-R2 is about 0.66, so not too bad. Still, the distribution of points has gaps in it, which limits the confidence of the correlation.

Now, what if the two points with “high” STK (> 180 ppm), were removed? How would that change the analysis? Go back and test yourself2. How do other models compare? Quadratic-plateau? Mitscherlich-Bray?


  1. nlsLM() is more reliabel overall than plain nls(). I’ve had pretty good luck with nls(), but sometimes it fails. nlsLM() seems to fail less. You can test their similarity by using a, b, c coefficients in the output of nlsLM() as inputs to nls().↩︎

  2. Hint: filter()↩︎

Built with Hugo
Theme Stack designed by Jimmy