Computation Frameworks for the Social Sciences (aka I’m teaching a class)

Posted on January 22, 2012 by josh

This spring I am teaching my first course! It is a pretty small seminar, 8-10 graduate students, but they all seem excited about the material. It is a “Programming for Political Scientists” course that will use Python to both teach people how to write good software as well as show them how people are using software in the discipline currently. I hope to spend the first half of the course covering basic software engineering and computer science concepts before moving on to some specific applications. Hopefully, by the end of the course all of the students will have built something that they can use to further their research agendas (e.g. a web scraper to supplement a data set).

The class website is up at joshcutler.github.com/PS398/ where you can find my schedule outline as well as the homeworks. One thing that has worked well so far was requiring that everyone work through Zed Shaw‘s Learn Python the Hard Way before class. This got everyone (note that most had not done any serious programming before) on the same page and ready to start tackling some more advanced ideas.

For those of you that have experience in the space, I would welcome any feedback on the syllabus (or anything else). As the course moves along I will try to post about any changes that I make, findings, etc. for those interested in teaching a similar course elsewhere.

Line numbers on embedded Gists

Posted on January 3, 2012 by josh

For all of you bloggers out there that like embedding gists but are frustrated by the lack of line numbers, I found a nice CSS solution. After a little googling, I found this solution by potch. It looks as though the structure of an embedded gist has changed slightly so I modified the css to look like the following:

.gist .highlight {
    border-left: 3ex solid #eee;
    position: relative;
}
 
.gist .highlight pre {
    counter-reset: linenumbers;
}
 
.gist .highlight pre div:before {
    color: #aaa;
    content: counter(linenumbers);
    counter-increment: linenumbers;
    left: -3ex;
    position: absolute;
    text-align: right;
    width: 2.5ex;
}

Include this somewhere in your stylesheets and viola, you get nice line numbers for all of your embedded gists (in modern browsers at least).

Rook

Posted on December 30, 2011 by josh

As I said in my last post, I am working on setting up a statistical web service using R. I decided to use the Rook library to do so and wanted to give a brief overview of how Rook works for others who might be interested.

The best way to learn is often to just dig right in and go line by line, so here is a simple rook application:

require('Rook')

library(Rook)
library(rjson)

rook = Rhttpd$new()
rook$add(
  name ="summarize",
  app  = function(env) {
    req = Rook::Request$new(env)
    
    numbers = as.numeric(unlist(strsplit(req$params()$numbers, ",")))
    
    results = list()
    results$mean = mean(numbers)
    results$sd = sd(numbers)

    res = Rook::Response$new()
    res$write(toJSON(results))
    res$finish()
  }
)

rook$browse("summarize")

view raw rookexample.R This Gist brought to you by GitHub.

Line 6 creates a new rook webserver object. This is what will respond to HTTP requests from the browser.

Line 7 adds an “application” to the webserver. When you add an application to a Rook server you need to name a route and specify what should happen when a user requests that action. The route is specified by line 8 and tells the Rook server that when a user requests “summarize” it should execute the code specified in lines 9-21. For some reason Rook prefaces all routes with the word “custom”, so the url to access the route we specified would be server:port/custom/summarize

The interesting part of the application occurs in the function that is assigned to app on line 9. Rook wraps the parameters of the HTTP request in some nice accessors which can be used as seen in line 10. While the docs specify all of the information available, the important part of the HTTP request for our app is the params() method. This returns the union of any variables passed via query string and POST. For the rails developers it is exactly the Rails params hash.

Line 12 parses the input. This application is expecting an array of numbers separated by commas assigned to the numbers parameter. Note that if this isn’t specified the app will break in its current state.

After parsing the input parameters we compute some simple summary statistics on this array of numbers and store them in the list results (14-16).

Rook uses the Rook::Response object to fashion a proper HTTP response. After instantiating the response on line 18, we call the write() method on the response object. This sends whatever string it is passed to the output stream which is returned to the requestor (in our case the web browser). Note that I am returning the results in JSON format and should probably set HTTP headers as well if I was going to deploy this to production. Line 20 flushes the output stream and returns it to the requestor, call this when you are done constructing your response.

Finally, to start our server and see the thing in action we can use the browse() method (line 24). We are specifying the action that we want to browse to as the parameter. This should pop open a web browser pointing to your Rook application–and you should get an error:

Error in strsplit(req$params()$numbers, ",") : non-character argument

This is because we didn’t validate the input and our application is expecting an array of numbers. So, lets pass them through the query string. Append the following to the URL in your browser: “?numbers=1,2,3,4,5″ and refresh. No you should get the following:

{"mean":3,"sd":1.58113883008419}

And you have a functioning Rook server!

If you want to add other functionality you can just make addition add() calls on the rook object and the new applications will be added. If you want to change the functionality of an existing application make sure that you remove (rook$remove(‘summarize’)) it and then re-add it (or just restart everything).

Rook and R Webservices

Posted on December 19, 2011 by josh

Recently I have been working on setting up a webservice that does some non trivial statistical work. Normally, my go to when building web services is Ruby/Rails due to ease of use, then I offload anything computationally intensive to something more optimized (e.g. a C or Java application on the same box). In this case however, partly because of my co-author’s skill, partly discipline norms, and a whole lot of R being awesome for this sort of thing, the statistical work is going to be done in R.

While it would have been possible to still build the webservice wrapper in Ruby and then either use one of the existing Ruby wrappers for R (or even spinning up an R process on its own), I wanted to see if I could build the whole thing in R. As is almost always the case, I was not the first person to think of this, and most of the hard work has already been done.

Rook is an R package on CRAN written by Jeffrey Horner. For those of you familiar with Ruby and Rack, Rook is very similar. It provides an interface, using R’s built in rApache webserver, to handle http requests, handle routing, etc. The more I read about it, the more I was convinced that it was a great solution to the problem that I was working on. Now if only I get it to run on Heroku….

Well, running Rook on Heroku was surprisingly simple thanks to Noah Lorang’s example which you can find here.

So, how do you get started? Almost everything that you’ll need is in the README in Noah’s repository, but there are a couple of tricky things to note.

First make sure that you have a Heroku account. It is free to sign up and you get one free full-time process per project (one single web server in our case). There are numerous resources (including their excellent help files) to get you through this part.

Next you can either walk through the instructions in Noah’s example (which I ended up doing), or you can do the much easier thing by cloning his repository. If you do this, then you should be able to deploy it directly.

After you get a running instance of Rook, you will want to write some R code. To run your own custom R script, replace the “/app/demo.R” in rackup file (config.ru) with the path of your script. Otherwise, you can just put your code in demo.R.

Because Heroku is a read-only file system, you will need to include any R packages in your source tree so that they are “installed” when you deploy. Initially you will just have R and Rook installed (if you cloned the existing project). Because some packages require native compilation, you really should do that compilation on one of heroku’s servers. In order to do this, you need to ssh into your app server:

heroku run bash

Once you are in the bash shell on your app server you can load R. From there install any packages that will be dependencies for you project. When you exit the R shell, do not log out of the ssh session. If you do the app will reset and you will need to start over (remember it is read only so the file system changes persist only as long as your session). You then need to figure out how you want to get these changes off of your heroku instance. First zip them up (I zipped up the whole bin directoy):

tar -cvzf mybin.tar.gz ./bin

Then you can either scp it off of the machine (as Noah suggests):

scp mybin.tar.gz me@myserver.com:~/myproject/bin.tar.gz

Or, if you do not have access to a destination that you can scp (heroku does not have ftp installed), you can do the roundabout method of setting up github as a remote, checking in the tarball, pushing it to github and then cloning that repository locally. Once you have the tarball on your machine, just untar it into your repositories bin directory, checkin, and deploy. You will now have access to those packages in R.

I’ll writeup how to actually use Rook in my next post.

Naive Bayes with Laplacean Smoothing

Posted on November 13, 2011 by josh

In aiclass.com, we just covered Naive Bayesian Classifiers, and it couldn’t have been more perfectly timed. Prior to that lecture series, one of the projects that I am working on required that I build a classifier for a large body of data that was getting funneled into the system. I spent quite a bit of time searching for the best way to do this, hoping that there would be a rubygem that could save me some effort, but much to my chagrin, nothing quite fit the bill–so I started in on building my own.

The basic idea behind a Naive Bayes classifier is that we have some set of documents that have been categorized (into n categories) and want to use this information about our existing labeled documents to predict the category of new, not yet labeled, documents. It is a pretty direct use of Bayes rule and is probably best understood through an example.

Say you have 5 documents:

{subject: ‘Must read!’, text: ‘Get Viagra cheap!’, label: ‘spam’}
{subject: ‘Gotta see this’, text: ‘Viagra. You can get it at cut rates’, label: ‘spam’}
{subject: ‘Call me tomorrow’, text: ‘We need to talk about scheduling. Call me.’, label: ‘not spam’}
{subject: ‘That was hilarious’, text: ‘Just saw that link you sent me’, label: ‘not spam’}
{subject: ‘dinner at 7′, text: ‘I got us a reservation tomorrow at 7′, label: ‘not spam’}

We have 2 spam message, and 3 real messages. Each of these messages has a subject and some text that we can use to train our classifier.

Given a new message:

{subject: ‘See it to believe it’, text: ‘Best rates you’ll see’, label: ?}

What is the probability that it is a spam message? Using Bayes rule we can compute it in the following way:

All of these values can be computed by inspection of the previous documents:

Note that in the case of Naive Bayes you assume independence of your variables (which is probably not true given that the english language is structured).

So for example, in the document we want to classify:

You will note that the document to classify has some words that are not in any of the existing classified documents (e.g. ‘believe’). This will give those conditional probabilities a value of 0, thus making the numerator 0 even though there is definitely a greater than 0 chance of this item being classified as spam.

The solution to this problem is known as Laplacean Smoothing. In order to perform smoothing you pick some parameter k. In our case we can set k=1. This smoothing parameter is added to all probabilities as they are calculated and a normalizing constant is added to the denominator to make it a valid probability.

Thus, with a smoother of size 1:

Where does the 5*1 come from in the denominator? Well we have a smoothing factor of 1 and we have 5 different known values for words in the subject, thus in order to make the known values a true probability distribution we need to add that to the denominator (so it sums to 1).

Like I said, that math here is pretty straightforward if you can buy the assumptions. And, even if you can’t it seems to work pretty well.

So, how did I end up using it in my app? I built a pretty simple gem to do classification called Classyfier. It was based loosely on Bayes Motel but I cleaned up and reorganized some things (as well as added smoothing). I anticipate that I will be adding more features to this package as my need for more sophisticated classification grows. For more info on how to use the gem see the example below or just checkout the test file.

require 'classyfier'

@classyfier = Classyfier::NaiveBayes::NaiveBayesClassifier.new
@classyfier.train({:subject => 'Must read!', :text => 'Get Viagra cheap!'}, :spam)
@classyfier.train({:subject => 'Gotta see this', :text => 'Viagra.  You can get it at cut rates'}, :spam)
@classyfier.train({:subject => 'Call me tomorrow', :text => 'We need to talk about scheduling.  Call me.'}, :not_spam)
@classyfier.train({:subject => 'That was hilarious', :text => 'Just saw that link you sent me'}, :not_spam)
@classyfier.train({:subject => 'dinner at 7', :text => 'I got us a reservation tomorrow at 7'}, :not_spam)
        
@scores = @classyfier.classify({:subject => 'See it to believe it', :text => 'Best rates you\'ll see'})

view raw gistfile1.rb This Gist brought to you by GitHub.

Related searches:
subject posted should server heroku

gipoco.com is neither affiliated with the authors of this page or responsible
for its contents. This is a safe-cache copy of the original web site.

gipoco.com is neither affiliated with the authors of this page nor responsible for its contents. This is a safe-cache copy of the original web site.

whyhat

…can math fix this?

Computation Frameworks for the Social Sciences (aka I’m teaching a class)

Line numbers on embedded Gists

Rook

Rook and R Webservices

Naive Bayes with Laplacean Smoothing