Friday, November 16, 2012

Machine Learning - predicting outcomes from activity

So I have been meandering through Andrew Ng's (Stanford) excellent Machine Learning class at Coursera, which I highly recommend.  

My progress in the course has been slowed by the fact I keep running across things I can use, and now. I'm in the unique position to have an applicable data set as well as real world use cases.  Once I found Mahout (scalable machine learning libraries that run on top of Hadoop) I was in.

I'm far less interested in the ins and outs of the actual ML algorithms than what they can do for me.  So, once I had a basic understanding of Classification Systems I started hacking.  

In my earlier (and kludgier) efforts at ML I had to define success, and then measure current students against whatever benchmark I set (based on historical data).  This means lots of manual code, repeated interrogation of the data, and scaling issues.

ML algorithms can help with all of these issues.


• Train a model on a data set.
• Test on a subset (if good continue).
• Use the model with live data to make predictions.
• Leverage those predictions to improve outcomes.


I'm currently leaning on logistic regression, but there are other algorithms in the toolshed depending on your problem set.




Think of my data as a giant matrix (well at least the numeric fields), that gets plugged in to the above equation.  Each line in the matrix is a training example.  The "features" I use are "predictors" in Mahout parlance and the classifier the "target".  

So I took the following snapshot of data from CJ101 classes in 2012, of all student activity until day 14, and appended the outcomes (grade, success).  We call these our training examples (~4200 used here, ~200K in total for 2012).

username,class,term,section,pct,mins,subs,days,grade,success …
...
TheBeaver,CJ101,1201A,14,0.97,12860.97,235,40,A,1 
WardCleaver,CJ101,1202B,05,0.86,2930.28,90,20,F,0
JuneCleaver,CJ101,1202C,16,0.94,6432.99,124,28,C,1
EddieHaskel,CJ101,1201C,03,1.0,7257.34,96,40,A,1
WallyCleaver,CJ101,1202B,18,0.81,29736.2,462,54,A,1
LumpyRutheford,CJ101,1201A,11,0.78,11685.95,308,48,C,1
WhiteyWitney,CJ101,1201A,03,0.96,37276.78,415,63,A,1
LarryMondello,CJ101,1202C,10,0.73,7223.47,150,22,F,0
MissLanders,CJ101,1204A,04,0.98,13717.4,292,57,B,1 
.. 

Here I set "success" to 1 for good student outcomes (A-C) and 0 for bad ones (D and below).  My goal is to train a model that uses my numeric features (pct, mins, days, subs) to predict the "success" value.

mahout trainlogistic --input CJ101_data.csv --output model --target success --categories 2 --predictors pct mins days subs  --types numeric --output ./model 

success ~ 0.001*Intercept Term + 0.009*days + -0.001*mins + 0.000*pct + 0.022*subs       
Intercept Term 0.00058                
days 0.00878                
mins -0.00119                 
pct 0.00041                
subs 0.02231

So the Mahout picks the weighting for my features which are used in the logistic regression, basically telling me how important each on of these terms are.  Looking at this particular class ... days and subs are much more important that mins or pct.

Now I need to test how good I was at predicting the success rate based on time, subs, days, pct.

mahout runlogistic –input CJ101_data.csv --model model  --auc –confusion 

AUC = 0.74
confusion: [[1594.0, 163.0], [16.0, 2.0]]
entropy: [[-0.0, -0.0], [-14.9, -2.6]]
12/11/14 15:32:14 INFO driver.MahoutDriver: Program took 844 ms 
* AUC is the success of our prediction

I can actually fiddle around and get this much higher (while making the predictions more usable as well), but let's save that for another posts.  The long and short is I can predict the outcomes of CJ101 students just 14 days into a 70 day class with 74% accuracy.

This information can not only identify struggling students very early into a class, but be leveraged in automated interventions.

Not too shabby at all!