Friday, May 20, 2016

Our House, in the Middle of Our Tweets: A summary in plain English

... I hope! Tell me if this is not actually as plain English as I hope it is. For the tl;dr, just read the headings.

1. We did a pretty good job of finding where people live, if they've posted geotagged tweets.

By "geotagged tweets", we mean "tweets with a lat/lon point." This is rare: about 1% of tweets have this. When you use Twitter, your tweets are not geotagged by default; you have to go in and select "yeah, add my location." (now, as of a few months ago, you even have to click another button that says "share precise location", so not many people do it.) But some people like to do it, to show that they are somewhere or remember or who knows why.

We tried to tell where they live at the neighborhood level. We could find about 79% of users' homes within 1km. (56% within 100m, 88% within 5km).

How do we know we found their homes? We collected 195 people's addresses in Pittsburgh by asking them in an online survey. (We asked the 4119 most common geotagged tweeters in Pittsburgh, 195 responded, after filtering out spam etc. We paid them with a $5 Amazon gift card.)

2. It's not that hard: remove daytime tweets and social cross-posts, and use grid search.

If you're trying to find someone's home, first take out all the tweets during the day (6am-8pm). Then take out all the social cross-posts from Foursquare and Instagram and all other social apps. In both of these cases, you lose a little bit of signal and a lot of noise. Like, your daytime tweets are sometimes at home and sometimes away, but your nighttime tweets are way more often at home.
Then use grid search. Bin all tweets into 1-degree lat-lon square, and pick the square that has the most tweets, and throw out the rest. Then bin those tweets into 0.1-degree squares, and pick the square that has the most tweets, and throw out the rest. Do the same at 0.01-degree and 0.001-degree. Center of that square is their address.

This might seem like a simple algorithm, and it is! We tried a bunch of more complicated things (see paper for details) and they didn't work as well.

3. However, this turns out to be more useful to learn things about places than about people.

Ok, pretty neat result, but sort of not awesome, for two reasons. First, 79% isn't that great - you can't really build that into a product if it fails 1/5 times. And there's good reasons we can't get much better - maybe 85% but probably not higher (see the paper). Second, as I just explained, almost nobody geotags their tweets! What good is a "learning about people" algorithm if it can only learn about 0.01% of the population?

Here's what it might be good for: learning about neighborhoods. If we can figure out where a bunch of people live, then we can put together a set of people who live in your neighborhood, and figure out what they're saying. That's what we're currently thinking.

More: read the paper!

No comments:

Post a Comment