Big Data Bloggin

Friday, May 3, 2013

KapX - Talking Real Time Intelligence

An interactive discussion of Real Time Intelligence at Kaplan.

KapX Technology Expo - 04-19-2013 - Ft. Lauderdale

Wednesday, April 3, 2013

Act before you think?

If someone asks me what Big Data buys you, I almost always reply with "speed".

Big Data solutions allow us to act rather than think. As an innovation guy and a follower of Lean Startup model, I am all about speed. Fail fast forward, learn you way, etc. This is the polar opposite of most BI and business processes. I'm not interested in models that explain what happened after the fact, but in tools that allow me to shape outcomes.

I was recently approached by a business partner, looking to solicit feedback on our call centers.

As chance would have it, we were in the final stages of getting our call center data (Genesys) into our data store (S3) and Hadoop cluster (EMR). As is normally the case, we weren't really sure what we'd do with it ... just that it would be interesting to have.

Anyhow, after a bit of fiddling we were able to tie the call logs (collected every 15 minutes) to our student records and had an avenue for polling our students. Our business partners were thrilled. They had our call center team working on popping voice call backs, but the project was flailing. Even more, we were skeptical that we'd get good interactivity by 1) asking for call back permission and then 2) calling students back with a phone survey.

So we got going, literally just putting together a daily csv that we sent to someone to manually email links out. Next we automated the pushing of the "survey list" to our outbound mail provider. Feedback was coming back, in droves. On top of emailing, we started pushing the survey links directly to the activity streams of students (on our Social Portal). It turns out these students were over 2x as likely to complete surveys. Finally we started survey students who we had called (outbound calls), "did we call at a good time", etc.

The voice callback survey project I mentioned earlier is still not live, frankly it is already irrlevant.

Please note, we did absolutely no analysis before we stated polling surveys. We made no predictions, and had no expectations. At the last minute, almost on a lark, our business partner suggested a free form text feedback block. So much good feedback started coming back we had to enlist a "fixer" to deal with student issues.

We are literally "saving" students who were ready to throw in the towel, call the BBB, etc.

All because we didn't wait. Having put the data together in an operation-able format, we were able to get out the door in a few days and then iterate and improve. Next we are going to pull in the survey results and use them to create a real time success dashboard. Which departments, teams, advisors are doing well?

We may need to revisit the old adage, "think before you act".

Thursday, February 7, 2013

amassing data - beg, borrow, steal

In an ideal world, all the data you need would be readily accessible.

Unfortunately, we don't live in that world. Data is often hidden from view, buried, and hoarded. Sometimes on purpose, and sometimes for perfectly valid reasons.

The role of a Big Data technologist more American Picker than archeologist. Be honest with yourself, you probably don't know exactly what looking for nor do you have time to figure it out. You'll know something cool when you see it. Make some reasonable assumptions of what you'd like to have and get after it.

You can store 1TB per year for around $600 at Amazon's S3, 10% of that if you put it in their long term storage (Glacier). Consider storage costs as damn near free, don't be afraid to start piling stuff up.

The following is my guide to getting data en masse.

Beg (good)

Ask for it.
Make this as painless as possible on all parties (go direct).
Be thankful and courteous, give props where they are do.

Borrow (better)

Repurpose others' data.
Don't wait for any changes, additions, formatting, etc. DIY.
Pulling data now is much better than waiting for someone to deliver it. Get going.

Steal (best)

Find out where data lives and go fishing. Poke around.
Better to plead forgiveness than ask permission.
If you have access, you probably aren't breaking any rules.

Wash, rinse, repeat.

At any given time you'll probably be doing all of these simultaneously. My only caveat is "don't store junk". I remove non-Ascii characters and validate that files are complete. But don't do much more than that.

Reserve the right to rethink any decisions on what to keep by not deleting anything.

Once you get used to working like this, you will be amazed at how much you can acquire relatively painlessly. Some things will take hours, and some unfortunately years. Some things you'll need ASAP, and some you'll have no idea what to do with (I started collecting some activity data 7 months ago that I never used until yesterday). Who knows?

Take what you can get now, and iterate. Good things will follow.

Monday, January 7, 2013

Why I switched to MapR and love EMR ...

A few months back I was approached by a colleague interested in using Hadoop/Hive to power our Tableau clients. We played around for a bit and learned, the hard way, they the only way were were going to get this to go easily was switching to a MapR distribution.

Fortunately, EMR (Elastic MapReduce) makes this a snap.

./elastic-mapreduce --create --alive --hive-interactive --name "Hive Dev Flow" --instance-t
ype m1.large --num-instances 8 --hadoop-version 1.0.3 --hive-versions 0.8.1.6 --ami-version 2.3.0 --with-supported-products mapr-m3

Literally, you just need to add the parameter (--with-supported-products mapr-m3).

Next you need to open up 8453 (and possibly some others) on your Hadoop master at EC2 (in Security Groups), and then run the MapR ODBC connector tool. Now you are ready to rock (with any ODBC Hadoop client). One noted caveat is that you can't run Gangila for stats, but as I noted in the EMR forum you can use the MapR Control Center to the same effect.

Now I have no idea if I will stick with MapR, but the ease of switching between Hadoop distributions, AMIs, versions, etc with EMR is pretty slick. It did take a some trial and error to figure out a working Hadoop/Hive/AMI version combo ... but nothing too crazy.

I know things like Apache WHIRR are making strides, but for now I am very happy with EMR.

Peace,
Tom

Thursday, December 20, 2012

using MapReduce to move files

One of the biggest issues in dealing with large volumes of data is moving it around. Thankfully, we use Amazon's S3 as a store and process via their EMR. Access inside their network is pretty snappy.

Still, I am moving a lot of data into our Hadoop clusters every morning. This data needs to go through our master and be distributed on to all of the slave nodes (normally 7). The following took a hair over 20 minutes.

hadoop fs -cp s3://eCollegeGradeData/activity/*.CSV /temp4/.

12/12/20 17:12:50 INFO s3native.NativeS3FileSystem: Opening 's3://eCollegeGradeData/activity/Activity_KU_20120127.CSV' for reading
12/12/20 17:12:51 INFO s3native.NativeS3FileSystem: Opening 's3://eCollegeGradeData/activity/Activity_KU_20120128.CSV' for reading
12/12/20 17:12:51 INFO s3native.NativeS3FileSystem: Opening
...
12/12/20 17:32:22 INFO s3native.NativeS3FileSystem: Opening 's3://eCollegeGradeData/activity/Activity_KU_Historical_Extract_20120126_7.CSV' for reading

S3DistCp is an S3 enabled version of Apache's Hadoop Disctcp that uses MapReduce (well Map, at least) to move the files directly form S3 to the Hadoop slave nodes eliminating the master node bottleneck.

hadoop distcp s3n://eCollegeGradeData/activity/ /temp4/

12/12/20 17:41:49 INFO tools.DistCp: srcPaths=[s3n://eCollegeGradeData/activity]
12/12/20 17:41:49 INFO tools.DistCp: destPath=/temp4
12/12/20 17:41:50 INFO metrics.MetricsSaver: MetricsSaver DistCp root:hdfs:///mnt/var/lib/hadoop/metrics/ period:60 instanceId:i-b0bbf4ce jobflow:j-3IZ0J8F1W334P
12/12/20 17:41:50 INFO metrics.MetricsUtil: supported product mapr-m3
12/12/20 17:41:50 INFO metrics.MetricsSaver: Disable MetricsSaver due to MapR cluster
12/12/20 17:41:51 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-80-63-76.ec2.internal/10.80.63.76:9001
12/12/20 17:41:51 INFO tools.DistCp: sourcePathsCount=337
12/12/20 17:41:51 INFO tools.DistCp: filesToCopyCount=336
12/12/20 17:41:51 INFO tools.DistCp: bytesToCopyCount=13.0g
12/12/20 17:41:51 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-80-63-76.ec2.internal/10.80.63.76:9001
12/12/20 17:41:51 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-80-63-76.ec2.internal/10.80.63.76:9001
12/12/20 17:41:51 INFO mapred.JobClient: Running job: job_201212191754_0027
12/12/20 17:41:52 INFO mapred.JobClient: map 0% reduce 0%
12/12/20 17:42:06 INFO mapred.JobClient: map 2% reduce 0%
12/12/20 17:42:07 INFO mapred.JobClient: map 5% reduce 0%
12/12/20 17:42:08 INFO mapred.JobClient: map 8% reduce 0%
12/12/20 17:42:09 INFO mapred.JobClient: map 9% reduce 0%
12/12/20 17:42:10 INFO mapred.JobClient: map 14% reduce 0%
12/12/20 17:42:11 INFO mapred.JobClient: map 16% reduce 0%
12/12/20 17:42:12 INFO mapred.JobClient: map 29% reduce 0%
12/12/20 17:42:13 INFO mapred.JobClient: map 31% reduce 0%
12/12/20 17:42:14 INFO mapred.JobClient: map 39% reduce 0%
12/12/20 17:42:15 INFO mapred.JobClient: map 41% reduce 0%
12/12/20 17:42:16 INFO mapred.JobClient: map 43% reduce 0%
12/12/20 17:42:17 INFO mapred.JobClient: map 45% reduce 0%
12/12/20 17:42:18 INFO mapred.JobClient: map 47% reduce 0%
12/12/20 17:42:19 INFO mapred.JobClient: map 50% reduce 0%
12/12/20 17:42:20 INFO mapred.JobClient: map 51% reduce 0%
12/12/20 17:42:21 INFO mapred.JobClient: map 53% reduce 0%
12/12/20 17:42:22 INFO mapred.JobClient: map 54% reduce 0%
12/12/20 17:42:23 INFO mapred.JobClient: map 56% reduce 0%
12/12/20 17:42:24 INFO mapred.JobClient: map 57% reduce 0%
12/12/20 17:42:25 INFO mapred.JobClient: map 58% reduce 0%
12/12/20 17:42:27 INFO mapred.JobClient: map 60% reduce 0%
12/12/20 17:42:30 INFO mapred.JobClient: map 65% reduce 0%
12/12/20 17:42:32 INFO mapred.JobClient: map 66% reduce 0%
12/12/20 17:42:33 INFO mapred.JobClient: map 69% reduce 0%
12/12/20 17:42:35 INFO mapred.JobClient: map 70% reduce 0%
12/12/20 17:42:36 INFO mapred.JobClient: map 71% reduce 0%
12/12/20 17:42:37 INFO mapred.JobClient: map 73% reduce 0%
12/12/20 17:42:38 INFO mapred.JobClient: map 74% reduce 0%
12/12/20 17:42:39 INFO mapred.JobClient: map 75% reduce 0%
12/12/20 17:42:40 INFO mapred.JobClient: map 76% reduce 0%
12/12/20 17:42:42 INFO mapred.JobClient: map 77% reduce 0%
12/12/20 17:42:44 INFO mapred.JobClient: map 78% reduce 0%
12/12/20 17:42:58 INFO mapred.JobClient: map 79% reduce 0%
12/12/20 17:43:02 INFO mapred.JobClient: map 83% reduce 0%
12/12/20 17:43:03 INFO mapred.JobClient: map 86% reduce 0%
12/12/20 17:43:04 INFO mapred.JobClient: map 87% reduce 0%
12/12/20 17:43:05 INFO mapred.JobClient: map 88% reduce 0%
12/12/20 17:43:07 INFO mapred.JobClient: map 90% reduce 0%
12/12/20 17:43:09 INFO mapred.JobClient: map 92% reduce 0%
12/12/20 17:43:11 INFO mapred.JobClient: map 93% reduce 0%
12/12/20 17:43:12 INFO mapred.JobClient: map 94% reduce 0%
12/12/20 17:43:14 INFO mapred.JobClient: map 97% reduce 0%
12/12/20 17:43:39 INFO mapred.JobClient: map 98% reduce 0%
12/12/20 17:43:45 INFO mapred.JobClient: map 99% reduce 0%
12/12/20 17:43:51 INFO mapred.JobClient: map 100% reduce 0%
12/12/20 17:46:43 INFO mapred.JobClient: Job complete: job_201212191754_0027
12/12/20 17:46:43 INFO mapred.JobClient: Counters: 21
12/12/20 17:46:43 INFO mapred.JobClient: Job Counters
12/12/20 17:46:43 INFO mapred.JobClient: Aggregate execution time of mappers(ms)=1924883
12/12/20 17:46:43 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/12/20 17:46:43 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/12/20 17:46:43 INFO mapred.JobClient: Launched map tasks=34
12/12/20 17:46:43 INFO mapred.JobClient: Aggregate execution time of reducers(ms)=0
12/12/20 17:46:43 INFO mapred.JobClient: FileSystemCounters
12/12/20 17:46:43 INFO mapred.JobClient: MAPRFS_BYTES_READ=84770
12/12/20 17:46:43 INFO mapred.JobClient: S3N_BYTES_READ=13971069631
12/12/20 17:46:43 INFO mapred.JobClient: MAPRFS_BYTES_WRITTEN=13971069631
12/12/20 17:46:43 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1659360
12/12/20 17:46:43 INFO mapred.JobClient: distcp
12/12/20 17:46:43 INFO mapred.JobClient: Files copied=336
12/12/20 17:46:43 INFO mapred.JobClient: Bytes copied=13971069631
12/12/20 17:46:43 INFO mapred.JobClient: Bytes expected=13971069631
12/12/20 17:46:43 INFO mapred.JobClient: Map-Reduce Framework
12/12/20 17:46:43 INFO mapred.JobClient: Map input records=336
12/12/20 17:46:43 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=4710170624
12/12/20 17:46:43 INFO mapred.JobClient: Spilled Records=0
12/12/20 17:46:43 INFO mapred.JobClient: CPU_MILLISECONDS=530380
12/12/20 17:46:43 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=56869019648
12/12/20 17:46:43 INFO mapred.JobClient: Map input bytes=50426
12/12/20 17:46:43 INFO mapred.JobClient: Map output records=0
12/12/20 17:46:43 INFO mapred.JobClient: SPLIT_RAW_BYTES=5100
12/12/20 17:46:43 INFO mapred.JobClient: GC time elapsed (ms)=3356
12/12/20 17:46:43 INFO metrics.MetricsSaver: Inside MetricsSaver Shutdown Hook

Now, I should have gone looking for a better solution months ago ... but frankly this wasn't that huge of an issue. I went from 20 minutes to 5 on this particular copy and have seen go even lower than that. If you add it all up, I was able to reduce my execution time by half an hour.

Not too shabby!

Friday, November 16, 2012

Machine Learning - predicting outcomes from activity

So I have been meandering through Andrew Ng's (Stanford) excellent Machine Learning class at Coursera, which I highly recommend.

My progress in the course has been slowed by the fact I keep running across things I can use, and now. I'm in the unique position to have an applicable data set as well as real world use cases. Once I found Mahout (scalable machine learning libraries that run on top of Hadoop) I was in.

I'm far less interested in the ins and outs of the actual ML algorithms than what they can do for me. So, once I had a basic understanding of Classification Systems I started hacking.

In my earlier (and kludgier) efforts at ML I had to define success, and then measure current students against whatever benchmark I set (based on historical data). This means lots of manual code, repeated interrogation of the data, and scaling issues.

ML algorithms can help with all of these issues.

• Train a model on a data set.
• Test on a subset (if good continue).
• Use the model with live data to make predictions.
• Leverage those predictions to improve outcomes.

I'm currently leaning on logistic regression, but there are other algorithms in the toolshed depending on your problem set.

Think of my data as a giant matrix (well at least the numeric fields), that gets plugged in to the above equation. Each line in the matrix is a training example. The "features" I use are "predictors" in Mahout parlance and the classifier the "target".

So I took the following snapshot of data from CJ101 classes in 2012, of all student activity until day 14, and appended the outcomes (grade, success). We call these our training examples (~4200 used here, ~200K in total for 2012).

username,class,term,section,pct,mins,subs,days,grade,success …
...
TheBeaver,CJ101,1201A,14,0.97,12860.97,235,40,A,1
WardCleaver,CJ101,1202B,05,0.86,2930.28,90,20,F,0
JuneCleaver,CJ101,1202C,16,0.94,6432.99,124,28,C,1
EddieHaskel,CJ101,1201C,03,1.0,7257.34,96,40,A,1
WallyCleaver,CJ101,1202B,18,0.81,29736.2,462,54,A,1
LumpyRutheford,CJ101,1201A,11,0.78,11685.95,308,48,C,1
WhiteyWitney,CJ101,1201A,03,0.96,37276.78,415,63,A,1
LarryMondello,CJ101,1202C,10,0.73,7223.47,150,22,F,0
MissLanders,CJ101,1204A,04,0.98,13717.4,292,57,B,1
..

Here I set "success" to 1 for good student outcomes (A-C) and 0 for bad ones (D and below). My goal is to train a model that uses my numeric features (pct, mins, days, subs) to predict the "success" value.

mahout trainlogistic --input CJ101_data.csv --output model --target success --categories 2 --predictors pct mins days subs --types numeric --output ./model

success ~ 0.001*Intercept Term + 0.009*days + -0.001*mins + 0.000*pct + 0.022*subs
Intercept Term 0.00058
days 0.00878
mins -0.00119
pct 0.00041
subs 0.02231

So the Mahout picks the weighting for my features which are used in the logistic regression, basically telling me how important each on of these terms are. Looking at this particular class ... days and subs are much more important that mins or pct.

Now I need to test how good I was at predicting the success rate based on time, subs, days, pct.

mahout runlogistic –input CJ101_data.csv --model model --auc –confusion

AUC = 0.74
confusion: [[1594.0, 163.0], [16.0, 2.0]]
entropy: [[-0.0, -0.0], [-14.9, -2.6]]
12/11/14 15:32:14 INFO driver.MahoutDriver: Program took 844 ms
* AUC is the success of our prediction

I can actually fiddle around and get this much higher (while making the predictions more usable as well), but let's save that for another posts. The long and short is I can predict the outcomes of CJ101 students just 14 days into a 70 day class with 74% accuracy.

This information can not only identify struggling students very early into a class, but be leveraged in automated interventions.

Not too shabby at all!

Wednesday, October 3, 2012

Machine Learning - Lite

This summary is not available. Please click here to view the post.