Restricted cubic splinesA flexible method for fitting regression linesPeter FlomBlockedUnblockFollowFollowingApr 8A spline is a drafting tool for drawing curves.

In statistics, splines are a broad class of methods for transforming variables.

I first introduce the concept via linear splines and work my way to restricted cubic splines which is what I (and many others) recommend.

You should be aware that there are a huge variety of splines and each has its proponents.

The pathway is as follows:1.

Dummy variables2.

Unrestricted linear splines3.

Restricted linear splines4.

Restricted cubic splinesIn an earlier article I showed that categorizing (which is the dummy variable method) isn’t a good method.

There are two basic problems with it: The relationship is flat within each segment and it jumps between segments.

In spline terminology, letting a curve jump is called “unrestricted”.

Step 2 gets rid of the flatness (but leaves us with straight lines that jump).

Step 3 gets rid of the jumps (but keeps the straight lines).

Step 4 lets the relationship in each section curve.

The result is a very flexible curve that has no jumps.

Restricted cubic splines (RCS) have many advantages but they have one big disadvantage: The resultant output is not always easy to interpret.

Two aspects of splines that we have not touched on is the number of knots to allow and how to place them.

Various proposals have been made, but Frank Harrell recommends using 4 knots if N < 100 and 5 for larger data sets and placing them at the 5th, 35th, 65th and 95th percentiles for k = 4 and the 5th, 27.

5th, 50th, 72.

5th and 95th for k = 5 (where k is the number of knots).

All this may be clearer by example.

I will start with a sine regression(as in my previous post) For comparison purposes, we should follow the above advice about knots for the dummy variable plot.

This yields:The first step is to allow the lines between each not to have nonzero slope.

Somewhat messy code (dammit Jim, I’m a data analyst, not a programmer!) is:mUnresLin1 <- lm(y~x, subset = (x6int == 1)) mUnresLin2 <- lm(y~x, subset = (x6int == 2)) mUnresLin3 <- lm(y~x, subset = (x6int == 3)) mUnresLin4 <- lm(y~x, subset = (x6int == 4)) mUnresLin5 <- lm(y~x, subset = (x6int == 5)) mUnresLin6 <- lm(y~x, subset = (x6int == 6)) plot(x[x6int == 1],mUnresLin1$fitted.

values, las = 1, xlab = "x", ylab = "y", col = "red", xlim = c(min(x),max(x)), ylim = c(min(y),max(y))) points(x[x6int == 2],mUnresLin2$fitted.

values, col = "red") points(x[x6int == 3],mUnresLin3$fitted.

values, col = "red") points(x[x6int == 4],mUnresLin4$fitted.

values, col = "red") points(x[x6int == 5],mUnresLin5$fitted.

values, col = "red") points(x[x6int == 6],mUnresLin6$fitted.

values, col = "red") points(x,y, pch = 20, cex = .

5)this yieldswhich is already a major improvement, but has jumps (one of them fairly large) and sudden shifts in direction that are probably as hard to justify as the jumps in the earlier model.

Next, we can force the lines to match up with a restricted linear spline.

There is already an R function for this, so the code is straightforward:install.

packages("lspline") library(lspline)mlinspline <- lm(y ~ lspline(x, quantile(x, c(0, .

05, .

275, .

5, .

775, .

95, 1), include.

lowest = TRUE)))plot(x,mlinspline$fitted.

values, las = 1, xlab = "x", ylab = "y", col = "red", xlim = c(min(x),max(x)), ylim = c(min(y),max(y)))points(x,y, pch = 20, cex = .

5)This produces:The final step is to allow the lines within each segment to curve.

We can do this with restricted cubic splines; again, there is an R package making this easy.

library(Hmisc)library(rms)mRCS <- ols(y~rcs(x, quantile(x, c(0, .

05, .

275, .

5, .

775, .

95, 1), include.

lowest = TRUE)))plot(x,mRCS$fitted.

values, col = "red", xlim = c(min(x),max(x)), ylim = c(min(y),max(y)))points(x,y)which produces:which matches the original sine curve very well.

.. More details