Tuesday, September 12, 2017

Regression with categorical variables: why have intercepts?

Or, "I understand Degrees of Freedom a little more now."
(This is kind of basic, so if you're good at regression, please bear with me. Also, apologies for trying to use Blogger to display data, yes I should probably use a table or like a Jupyter notebook, but... well, bear with me again.)

Ok, first imagine you have 1 categorical variable that predicts something, like your score on some test. Say the categorical variable is "do you like Star Wars or Star Trek better", and your data looks like this:

  • SW, score 3
  • ST, score 4
  • ST, score 4
  • SW, score 3

(I mean, I grew up on Star Wars myself. But it's hard to argue that it's a smarter show :P)
You can do this regression a couple ways:

  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x2 = 1 if they like Star Trek better, 0 otherwise
  • Regression equation: Score = 3*x1 + 4*x2

or

  • x0 = 1 always (call this the "intercept")
  • x1 = 1 if they like Star Wars better, 0 otherwise
  • Regression equation: Score = 4*x0 + (-1)*x1

You cannot do this:

  • x0 = 1 always
  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x2 = 1 if they like Star Trek better, 0 otherwise

Because you run into "nonidentifiability" because your predictors become collinear with the intercept. I mean, try it - try to fit a regression equation. Should it be this:
Score = 3*x0 + 0*x1 + 1*x2
or
Score = 0*x0 + 3*x1 + 4*x2
or
Score = 1000*x0 + (-997)*x1 + (-996)*x2
? All these fit the data perfectly well. You've got too many predictors. Another way of saying this is, you've got too many degrees of freedom.

But the question still remains, which two variables do you use? Like, the first way (with x1 and x2) seems really appealing, because you straight-up get the answer of how important each predictor is. But if you get used to using all levels of your categorical variables and having no intercepts, well, you may fall into a trap! Let's see how...

Now imagine you have 2 categorical variables that predict something, like, I dunno, score on some test. Say the variables are "do you like Star Wars or Star Trek better" and favorite kind of small fish (sardine, mackerel, sprat, or herring; there are only 4 fish in this world) And imagine our data looks like this:


  • SW, mackerel, score 5
  • ST, mackerel, score 6
  • ST, herring, score 8
  • ST, mackerel, score 6
  • SW, sprat, score 6
  • SW, sardine, score 4

You might be tempted to code them like this:

  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x2 = 1 if they like Star Trek better, 0 otherwise
  • x3 = 1 if they like sardines best, 0 otherwise
  • x4 = 1 if they like mackerel best, 0 otherwise
  • x5 = 1 if they like sprats best, 0 otherwise
  • x6 = 1 if they like herring best, 0 otherwise

But then you get the same problem. Is it:
score = 3*x1 + 4*x2 + 1*x3 + 2*x4 + 3*x5 + 4*x6
or
score = 4*x1 + 5*x2 + 0*x3 + 1*x4 + 2*x5 + 3*x6
or
score = 1004*x1 + 1005*x2 + (-1000)*x3 + (-1001)*x4 + (-1002)*x5 + (-1003)*x6
? Again, these all fit the data perfectly.

Instead, you can do this:

  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x3 = 1 if they like sardines best, 0 otherwise
  • x4 = 1 if they like mackerel best, 0 otherwise
  • x5 = 1 if they like sprats best, 0 otherwise

And then you just know that, if x1=0, you've got a Star Trek fan, and if x3 x4 and x5 are all 0, you've got a herring eater.

But it doesn't quite fit the data. You can kinda tell that Star Trek gives you a 1-point boost over Star Wars, and you kinda know that herring > sprats > mackerel > sardines, but you can't model the fact that you just always have a baseline score. Or rather, imagine a Star Trek-loving herring eater; all the variables would be 0, so you have to predict their score is 0. Obviously that is not the case.

But to solve this, we just have to throw in another intercept:
x0 = 1 always.

Then we can fit exactly one regression equation that perfectly fits the data (or, minimizes the sum of squared error):
Score = 8*x0 + (-1)*x1 + (-3)*x3 + (-2)*x4 + (-1)*x5

To put it another way: for each categorical variable, we have N levels, but we only get N-1 degrees of freedom. So our final equation should add N-1 terms. And you always start out with one DF (for the intercept). In our first example (with just Star Wars/Trek), we had one variable with 2 levels, so the number of DF we get is 1 (intercept) + 1 (2 levels). So our regression equation should have 2 terms. In the second example, we had one variable with 2 levels and one with 4, so we should have 1+(2-1)+(4-1)=5 terms. And doing the trick where one level of your variable is the "baseline" is really the only way to do that.

So, if you only have one categorical variable, you can ignore the intercept and use all levels, that's ok. But it falls apart pretty quick as your regression gets bigger, and you've got to do the by-the-book way where you use one level as reference for each variable and add an intercept.

Friday, May 19, 2017

State of the Geotags: Motivations and Recent Changes

... is the title of our recent paper at ICWSM 2017, "our" being me, Zichen Liu, Alex Sciuto, and all of our kind advisor Jason Hong.

The paper, in one sentence: think of geotagged social media posts (tweets, instagrams, etc) as postcards, not as ticket stubs; as conscious choices, not unconscious byproducts.

In three bullet points:
- when people are posting something to Twitter/Flickr/etc, they usually consciously choose to add their location; it's not a "set it and forget it" situation. (we found this out by analyzing how often people toggle between geotagging and not.)
- when people post their location, they usually do it from unusual or faraway places. They usually don't do it at home or in their neighborhood, and they usually don't do it from places that they go to regularly. (this is from surveys.)
- people are posting their specific location less than they used to. Some of this might be privacy; a lot of it is because Twitter changed the defaults on how your location is posted.

In more detail:
Paper Slides

Friday, March 31, 2017

Getting from Zero to What I Do Most Of The Time With Data

We've been getting a lot of undergrads and master's students coming on board in our lab, with pretty vastly different levels of experience. That is good! More diversity, the better, I say.

However, it tends to be hardest for those with the least experience. Often I'll say something like "just ssh in to the server, connect to our postgres database, and get all the tweets in this area." And they'll be like "oops I guess I was supposed to know what ssh and postgres are, but I don't, so now I'm either trying to bluff or googling frantically." Which is too bad! I think what they should do here is ask me for advice, but they don't know that. They might think that I'm either a jerk who would make fun of them for not knowing that much, or that I'm a person with wayyy too many responsibilities to possibly give them the time they need (i.e. a professor).

I do appreciate their concern for my time, though, and it's probably more fun for them to learn things themselves (go at your own pace, etc), so I've put together this list of useful guides.

Unix Computing Basics
How to Unix (mac) - work through Conquering the Command Line (chapter 1)
To get started with this, open up the program "Terminal" on your mac. You can do that by going to the magnifying glass in the top right and typing "terminal."
How to get a Unix-ish prompt on Windows - I don't actually know. Someone suggest me a tutorial for this.
How to Unix (Linux) - despite this being the year of the Linux desktop, few people have one. If you do, open up a terminal however you do. I used to run Ubuntu and it made that pretty easy.
SSH (to connect to a remote server and navigate around there) - the "basic syntax" part is fine. You probably won't need the "keys" bit but it might be fun if you want to look around later.
SFTP (if you want to download a file from a remote server)
git: try Software Carpentry's git novice course. (parts 1-9 especially.) All of Software Carpentry's stuff seems pretty good.

Vim and other Text Editing
You should probably know at least the basics of Vim, because it's installed on every computer ever, and you always need to edit text files. Also, sometimes you'll end up in vim for some reason and it's good to be able to quit. Learn Enough's text editor class (at least chapter 1, Vim) seems like a good place to start.

Python
Software Carpentry has a good lesson here too, specialized for research computing.
For more general python, or if you are starting from zero programming, you might have more luck with Learn Python The Hard Way (for everything; long, but you can breeze through the parts you already know.)
pip and virtualenv. Generally you should make a virtualenv for each project. Stuff may work without it, but then you've got global dependencies (so if you need module A to be v3.0, but you update it to 4.0 for some other project, your old thing that's still expecting it to be 3.0 may stop working. virtualenv gives you a separate copy of module A for each project that needs it). You might see some places recommending you use conda instead; it's fine too, I have less experience with it, but it'll probably get you where you need to go.
The csv module is particularly useful, here is a guide for it. As is the Argparse module; tutorial here.
If you need to send out HTTP requests, use the requests module.

PostgreSQL - this is an ok tutorial. Part VII may be more complicated than you need. I'd love to see a better tutorial too. SQLBolt may be this better tutorial.
A bit about PostGIS - postGIS is a library that lets you use geo data in your PostgreSQL database somewhat sanely. You can probably skip most of this. SELECT * FROM tweet_pgh WHERE ST_MakeEnvelope(-79.9, 40.44, -79.899, 40.441, 4326) ~ coordinates; is probably what you need.

There's probably a lot more useful stuff I could put here! Let me know if you've got anything I should add. Also, tell me if you have feedback, good or bad, on any of these.