Monday, February 8, 2016

Welcome to Domo

EDIT: the Domo server is now shut down. If you want access to any of this data, ask me to put you in touch with Shuguan Yang and Sean Qian who are running a server that has this data now. Or, ask me about the S3 bucket that it's all stored in.

I've told this to a lot of people so I've decided to store it all in one place. This guide will range from super-basic to kinda-complicated, so apologies if it's obvious in parts, and apologies if you get lost in parts. ALSO, if you're reading this and you're not new to our group and/or server, then you may have some advice for me, and I'd appreciate it!

Domo is our Amazon Web Services server. It's named after this guy:

On Domo, we have some coordinate geotagged tweets in some cities: (all stored in PostgreSQL database "tweet")
Pittsburgh: since Jan 22, 2014. (table tweet_pgh)
SF, NY: since June 13, 2014. (tweet_sf, tweet_ny)
Houston, Cleveland, Seattle, Miami, Detroit, Chicago, London: since November 7, 2014
Minneapolis: since March 18, 2015
San Antonio, Austin, and Dallas: since June 15, 2015
(everything after SF and NY is stored in tweet_(cityname)) where cityname is lowercase, all one word)
We also have tweets in Pittsburgh beyond just coordinate-geotagged tweets, in table tweet_pgh_all.
We also have Instagrams in Pittsburgh from fall 2014 to May 2016 (when Instagram shut off access to public geotagged Instagrams.) - table instagram_pgh
And some flickr photos and other misc data sets. (not in PostgreSQL; in /data/datasets/)

We really only interact with Domo via terminal windows, so if that's not your forte, you may have some difficulty. To log in, use "ssh (your username on Domo)@(Domo's hostname)"
If you want to make it easier, you can open ~/.ssh/config and add an entry:

Host domo
Hostname (Domo's hostname)
User (your username on Domo)

We store the tweets in PostgreSQL. If you've used other SQLs, it's pretty similar, but not the same. Things to know about Postgres and our DB in particular:
  • psql tweet to connect to our database (which is called "tweet").
  • \d to list all relations (aka tables, kinda)
  • \d tablename to get more info about a certain table.
  • The tweets go in basically direct from the Twitter 1% public feed (using this script). They're all stored as text and integers except for some things that are "hstores" - basically key-value sets - and the "coordinates", which are stored using PostGIS as Points.
  • To access those Points, use some of the PostGIS functions. For example, SELECT ST_AsText to get it in a semi-readable format. ST_AsGeoJSON has been the most useful for me.
  • To query all tweets within an area: SELECT * FROM tweet_pgh WHERE ST_MakeEnvelope(-79.9, 40.44, -79.899, 40.441, 4326) ~ coordinates; 
    • (that "4326" is, for current purposes, a magic number. It means EPSG 4326/WGS-84 which is pretty much a standard for everything I do. So I always just leave it as 4326, and if you don't know better, I suggest you do too.)
Things to know about Domo:
  • Change your password right away. Do this by typing "passwd" after you SSH in.
  • Don't store things in your homedir! Our whole homedir partition only has about 8Gb. Obviously, that fills up fast. Store anything you can in /data - that has 1Tb. I might bug you sometimes to clean up your homedir if you end up using a lot of space.
When I add you to Domo, I'll tell you:
  • your username on Domo
  • your temporary password (change this as soon as possible)
  • Domo's hostname (not shown here so we get attacked as little as possible)
You should tell me:
  • if you want a username that's different than your email address, tell me ASAP and I'll create that and delete your old one.
  • your github username so I can add you to our github organization.
Dan's note to himself:
  • give the new person an account with sudo adduser username
  • give them a postgresql account (CREATE USER username;) and give them permission to read all the tables (GRANT SELECT ON ALL TABLES IN SCHEMA public TO username)
  • get their github username and give them access to the CMUChimpsLab organization too.

No comments:

Post a Comment