Friday, July 5, 2013

variety over volume

When embarking on a Big Data effort, it almost always pays go for variety first.  While volume and velocity may be sexy, variety often makes data sets usable.


Now you can do all sorts of Machine Learning / Math to go and prove whether you should be going wide (adding variety) or deep (adding volume), but I'd like to offer a much more practical example.

Earlier I've written about our "Voice of the Customer" Pilot.  We are basically polling all our students who call in (or who we call).  We've had a tremendous response, but selling the data internally has been a bit of a slog because we tied it all together manually in a cumbersome and irregular process.

So I recently took the time to add the survey results to our Hadoop Cluster.

Somehow I convinced Wufoo (where our surveys live) to give me an integration key (a giant PITA), then leveraged the excellent WuParty ruby gem to download the survey results, which I drop in a tab delimited file and ship to S3 (where my EMR cluster picks them up for processing).

This is like 100K of data a day.  Certainty not big.

But with this added data source (aka additional variety) I am able to tie the actual calls to the survey results.  I throw this together in a delimited file which our BI team picks up from S3 and leverages to build reports for the CC managers and departments.


This is damn near real time feedback on our call centers, and it is accessible.
hive -e 'select fb.username, fb.email, fb.interaction_id, fb.department, call.department, call.resource_name, fb.satisfied, fb.ease, fb.recommend, fb.comments, fb.created_date, call.l1_name, call.l2_name, call.l3_name, call.l4_name from feedback fb JOIN calls call ON (fb.interaction_id=call.interaction_id) WHERE resource_name RLIKE "^[A-Za-z]" AND call.l1_name IS NOT NULL and call.l1_name <> "" GROUP BY fb.username, fb.email, fb.interaction_id, fb.department, call.resource_name, call.department, fb.satisfied, fb.ease, fb.recommend, fb.comments, fb.created_date, call.l1_name, call.l2_name, call.l3_name, call.l4_name;' > agent_feedback.csv
So what I've done here, is add almost no volume to our cluster ... but deliver a ton of value with variety.

And the survey results are not a one trick pony, now that I have this data coming into my cluster I can leverage it new and unthought of ways.  How about looking at student outcomes and CSAT scores?  Can falling CSAT predict bad outcomes?  How about the opposite?

Good questions start to arise only when we have the data to ask them.