Tuesday, September 12, 2017

Regression with categorical variables: why have intercepts?

Or, "I understand Degrees of Freedom a little more now."
(This is kind of basic, so if you're good at regression, please bear with me. Also, apologies for trying to use Blogger to display data, yes I should probably use a table or like a Jupyter notebook, but... well, bear with me again.)

Ok, first imagine you have 1 categorical variable that predicts something, like your score on some test. Say the categorical variable is "do you like Star Wars or Star Trek better", and your data looks like this:

  • SW, score 3
  • ST, score 4
  • ST, score 4
  • SW, score 3

(I mean, I grew up on Star Wars myself. But it's hard to argue that it's a smarter show :P)
You can do this regression a couple ways:

  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x2 = 1 if they like Star Trek better, 0 otherwise
  • Regression equation: Score = 3*x1 + 4*x2

or

  • x0 = 1 always (call this the "intercept")
  • x1 = 1 if they like Star Wars better, 0 otherwise
  • Regression equation: Score = 4*x0 + (-1)*x1

You cannot do this:

  • x0 = 1 always
  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x2 = 1 if they like Star Trek better, 0 otherwise

Because you run into "nonidentifiability" because your predictors become collinear with the intercept. I mean, try it - try to fit a regression equation. Should it be this:
Score = 3*x0 + 0*x1 + 1*x2
or
Score = 0*x0 + 3*x1 + 4*x2
or
Score = 1000*x0 + (-997)*x1 + (-996)*x2
? All these fit the data perfectly well. You've got too many predictors. Another way of saying this is, you've got too many degrees of freedom.

But the question still remains, which two variables do you use? Like, the first way (with x1 and x2) seems really appealing, because you straight-up get the answer of how important each predictor is. But if you get used to using all levels of your categorical variables and having no intercepts, well, you may fall into a trap! Let's see how...

Now imagine you have 2 categorical variables that predict something, like, I dunno, score on some test. Say the variables are "do you like Star Wars or Star Trek better" and favorite kind of small fish (sardine, mackerel, sprat, or herring; there are only 4 fish in this world) And imagine our data looks like this:


  • SW, mackerel, score 5
  • ST, mackerel, score 6
  • ST, herring, score 8
  • ST, mackerel, score 6
  • SW, sprat, score 6
  • SW, sardine, score 4

You might be tempted to code them like this:

  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x2 = 1 if they like Star Trek better, 0 otherwise
  • x3 = 1 if they like sardines best, 0 otherwise
  • x4 = 1 if they like mackerel best, 0 otherwise
  • x5 = 1 if they like sprats best, 0 otherwise
  • x6 = 1 if they like herring best, 0 otherwise

But then you get the same problem. Is it:
score = 3*x1 + 4*x2 + 1*x3 + 2*x4 + 3*x5 + 4*x6
or
score = 4*x1 + 5*x2 + 0*x3 + 1*x4 + 2*x5 + 3*x6
or
score = 1004*x1 + 1005*x2 + (-1000)*x3 + (-1001)*x4 + (-1002)*x5 + (-1003)*x6
? Again, these all fit the data perfectly.

Instead, you can do this:

  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x3 = 1 if they like sardines best, 0 otherwise
  • x4 = 1 if they like mackerel best, 0 otherwise
  • x5 = 1 if they like sprats best, 0 otherwise

And then you just know that, if x1=0, you've got a Star Trek fan, and if x3 x4 and x5 are all 0, you've got a herring eater.

But it doesn't quite fit the data. You can kinda tell that Star Trek gives you a 1-point boost over Star Wars, and you kinda know that herring > sprats > mackerel > sardines, but you can't model the fact that you just always have a baseline score. Or rather, imagine a Star Trek-loving herring eater; all the variables would be 0, so you have to predict their score is 0. Obviously that is not the case.

But to solve this, we just have to throw in another intercept:
x0 = 1 always.

Then we can fit exactly one regression equation that perfectly fits the data (or, minimizes the sum of squared error):
Score = 8*x0 + (-1)*x1 + (-3)*x3 + (-2)*x4 + (-1)*x5

To put it another way: for each categorical variable, we have N levels, but we only get N-1 degrees of freedom. So our final equation should add N-1 terms. And you always start out with one DF (for the intercept). In our first example (with just Star Wars/Trek), we had one variable with 2 levels, so the number of DF we get is 1 (intercept) + 1 (2 levels). So our regression equation should have 2 terms. In the second example, we had one variable with 2 levels and one with 4, so we should have 1+(2-1)+(4-1)=5 terms. And doing the trick where one level of your variable is the "baseline" is really the only way to do that.

So, if you only have one categorical variable, you can ignore the intercept and use all levels, that's ok. But it falls apart pretty quick as your regression gets bigger, and you've got to do the by-the-book way where you use one level as reference for each variable and add an intercept.