Friday, June 22, 2012

fear and loathing in the Hadoop Cluster

What I've noticed most on the business end of Big Data is the love/hate dichotomy.

I'm not talking about DW or BI folks who underestimate the promise of never having to throw data away, using commodity hardware, realizing the scaling efficiencies of the cloud, etc.  It will take time, but they are at least coming around.

I'm talking about folks who are actually scared of data.

Now, I'm a pretty open guy.  I actually like getting proven wrong, which happens an astonishing amount of the time.  It won't stop me from having an opinion, arguing for what I think is right, etc.  It is just the Socratic worldview that I employ.  My goal is to get better, learn more ... not be right.

As I've amassed a good amount of our working business data, I talk it up.  A lot.  To anyone who will listen.  What was shocking from my end were the reactions I often get:

  1. "the data is wrong"
  2. "our data is better" (even if only a small subset ), cousin of #1
  3. "what you are looking at doesn't matter ... what really matters is X" 
  4. "we can't possibly show that to end users without causing a thermonuclear meltdown"
  5. <heads firmly implanted in sand>
People are pretty tied to their preconceived notions, and while I've had some very receptive and faithful allies across our business units there are a lot more people who'd rather I go away.  If my data doesn't prove their line of thinking, they'll do what they can to squash or ignore it.

I doubt I'm alone here.  

My guess is that while businesses want to be results driven, it sounds better in a powerpoint deck than in reality.  They need to be a bit more open to what they might find, even it it's not good.  I once got an "A" on a 20 page Math Paper in college for proving that I was an idiot.  

The future will not be won by the blissfully ignorant, but those willing to learn their way forwards. 

Business successfully leveraging BigData will follow an iterative, lean, and inquisitive mindset.  Twitter started out as a file sharing app, and Flickr as a game, Nokia making paper/galoshes, IBM actually made stuff.  Call it a pivot, iteration, rehash ... whatever you like.  

I call it winning.  

Big Data can help business known when and how to pivot, giving insight into black holes and unverifiable ideas.  It won't solve all our problems, but let's not dismiss what makes us uncomfortable.




Monday, June 11, 2012

digging for gold

I'm pretty new to the world of textual analysis, but the problem is pretty typical of what I've seen in Big Data.

You typically break the process into three steps:
  1. getting and cleaning the data
  2. dicing it up and putting it back together in interesting ways
  3. presenting it
While #2, the big data bit, is the most sexy and fun ... you end up spending most of your time on #1 and #3 (with the added bonus of eventually having to automate the process).  Data is hard to get, ugly, and figuring out how to present it to end users is a bear.  You need to get data moving through the system, and out to presentation.

Recently I uncovered a treasure trove of discussions, literally all the discussions for about 50K university students.  Pretty much all of their day to day coursework.  Cool, eh?

Well, I first had to beg the vendor to start giving me the data.  Then when I got it, it was full of all sorts of garbage from the DW reporting tool (MS Characters, poorly formed course names, etc).  So I rehash a bunch of ruby sftp and bizarre shell scripts (sed, awk, tr, expect, etc) to get the data into a usable form.

Then I beg the vendor for key mappings so I can get from this raw data back to mine (student and course id's in this case ... in hive).  We decide to append this info on a current report (after we duplicate it).  All is well.  But, no.  Turns out that isn't going to map all the users, specifically ones who haven't been logged on for five minutes or more.  Vendor goes dark for a while despot my pleading.  Some more begging turns up a solution, in a report we are already running (for something else).  I retool all my scripts using the other report, and nothing maps.  I bang my head against the desk repeatedly while checking each step along the way.  Cleaning, parsing, reformatting, pushing, SQL?  Still nothing.  Finally I cave and go back to begging the vendor.  

We set up a meeting, etc.  They think I'm a moron (which at this point might well be true).

Meanwhile I am still hacking the data.  I document a pretty good example and send it on.  Oh wait, the vendor forgot to add a step when he automated the report.  It worked for 05/01 but no other dates.  They rerun the report for the last few days, and it maps.  Great.  I get going again and notice all sorts of duplicates, same entries on different days reports.  I ask if this is expected and a get a bizarre yes as a reply.

Thankfully I hadn't stopped hacking the data into shape while waiting for a reply.  I carried on knowing the data was wrong (duplicated), that I could fix that specific problem later (removing the dups myself).

Don't wait, you probably don't know how to interrogate the data yet anyhow.  Keep pushing.

The key is to get the up front pain and suffering of getting data flowing through the system.  Once I have the data flowing through the system, I can continually dig for gold.  That is the beauty of NOSQL.  We don't necessarily have to know what we are looking for when we start collecting data.

The real findings need to be teased out, over time.  Get the data in there, start piling it high, and THEN dig for gold.