Friday, June 22, 2012

fear and loathing in the Hadoop Cluster

What I've noticed most on the business end of Big Data is the love/hate dichotomy.

I'm not talking about DW or BI folks who underestimate the promise of never having to throw data away, using commodity hardware, realizing the scaling efficiencies of the cloud, etc.  It will take time, but they are at least coming around.

I'm talking about folks who are actually scared of data.

Now, I'm a pretty open guy.  I actually like getting proven wrong, which happens an astonishing amount of the time.  It won't stop me from having an opinion, arguing for what I think is right, etc.  It is just the Socratic worldview that I employ.  My goal is to get better, learn more ... not be right.

As I've amassed a good amount of our working business data, I talk it up.  A lot.  To anyone who will listen.  What was shocking from my end were the reactions I often get:

  1. "the data is wrong"
  2. "our data is better" (even if only a small subset ), cousin of #1
  3. "what you are looking at doesn't matter ... what really matters is X" 
  4. "we can't possibly show that to end users without causing a thermonuclear meltdown"
  5. <heads firmly implanted in sand>
People are pretty tied to their preconceived notions, and while I've had some very receptive and faithful allies across our business units there are a lot more people who'd rather I go away.  If my data doesn't prove their line of thinking, they'll do what they can to squash or ignore it.

I doubt I'm alone here.  

My guess is that while businesses want to be results driven, it sounds better in a powerpoint deck than in reality.  They need to be a bit more open to what they might find, even it it's not good.  I once got an "A" on a 20 page Math Paper in college for proving that I was an idiot.  

The future will not be won by the blissfully ignorant, but those willing to learn their way forwards. 

Business successfully leveraging BigData will follow an iterative, lean, and inquisitive mindset.  Twitter started out as a file sharing app, and Flickr as a game, Nokia making paper/galoshes, IBM actually made stuff.  Call it a pivot, iteration, rehash ... whatever you like.  

I call it winning.  

Big Data can help business known when and how to pivot, giving insight into black holes and unverifiable ideas.  It won't solve all our problems, but let's not dismiss what makes us uncomfortable.




Monday, June 11, 2012

digging for gold

I'm pretty new to the world of textual analysis, but the problem is pretty typical of what I've seen in Big Data.

You typically break the process into three steps:
  1. getting and cleaning the data
  2. dicing it up and putting it back together in interesting ways
  3. presenting it
While #2, the big data bit, is the most sexy and fun ... you end up spending most of your time on #1 and #3 (with the added bonus of eventually having to automate the process).  Data is hard to get, ugly, and figuring out how to present it to end users is a bear.  You need to get data moving through the system, and out to presentation.

Recently I uncovered a treasure trove of discussions, literally all the discussions for about 50K university students.  Pretty much all of their day to day coursework.  Cool, eh?

Well, I first had to beg the vendor to start giving me the data.  Then when I got it, it was full of all sorts of garbage from the DW reporting tool (MS Characters, poorly formed course names, etc).  So I rehash a bunch of ruby sftp and bizarre shell scripts (sed, awk, tr, expect, etc) to get the data into a usable form.

Then I beg the vendor for key mappings so I can get from this raw data back to mine (student and course id's in this case ... in hive).  We decide to append this info on a current report (after we duplicate it).  All is well.  But, no.  Turns out that isn't going to map all the users, specifically ones who haven't been logged on for five minutes or more.  Vendor goes dark for a while despot my pleading.  Some more begging turns up a solution, in a report we are already running (for something else).  I retool all my scripts using the other report, and nothing maps.  I bang my head against the desk repeatedly while checking each step along the way.  Cleaning, parsing, reformatting, pushing, SQL?  Still nothing.  Finally I cave and go back to begging the vendor.  

We set up a meeting, etc.  They think I'm a moron (which at this point might well be true).

Meanwhile I am still hacking the data.  I document a pretty good example and send it on.  Oh wait, the vendor forgot to add a step when he automated the report.  It worked for 05/01 but no other dates.  They rerun the report for the last few days, and it maps.  Great.  I get going again and notice all sorts of duplicates, same entries on different days reports.  I ask if this is expected and a get a bizarre yes as a reply.

Thankfully I hadn't stopped hacking the data into shape while waiting for a reply.  I carried on knowing the data was wrong (duplicated), that I could fix that specific problem later (removing the dups myself).

Don't wait, you probably don't know how to interrogate the data yet anyhow.  Keep pushing.

The key is to get the up front pain and suffering of getting data flowing through the system.  Once I have the data flowing through the system, I can continually dig for gold.  That is the beauty of NOSQL.  We don't necessarily have to know what we are looking for when we start collecting data.

The real findings need to be teased out, over time.  Get the data in there, start piling it high, and THEN dig for gold.



Wednesday, May 30, 2012

Big Data ... say what?

Everyone is talking about Big Data, which probably has 10 different definitions depending on who you are talking to, but no one is blogging.  Seems odd.

My goal is to have an open and honest look at what Big Data means and can offer, with a focus on who is doing it right.  My gut tells me that the promise of Big Data is more transformative than incremental, and that of the solutions and use cases one sees peddled very few lean in that direction.

Let's start with a few of the buzzwords frequently thrown around with Big Data:

  • data warehousing
  • business intelligence
  • KPI
  • structured/unstructured data (NOSQL)
  • data visualization 
  • machine learning
I've gone from least to most interesting here.  The reason I say that is even though the era of Big Data is and will continue to impact the DW and BI worlds, those are pretty limited and very costly things to do.  You are generally only going to see incremental gains ... and costly ones at that.  

That doesn't mean that DWs aren't going to need to reorganize in an age where we don't really ever have to throw away data, or BI tools aren't going to need the ability to access hadoop clusters for dealing with larger data sets.  They are.  

But there are fundamental problems with traditional BI, it is resource intensive and slow as a feedback loop.

Let's say we want to do a typical BI project.  We convene a group of product, technical, and BI folks to talk over an idea.  Say we want to increase the number of connections someone on a network has.  The product folks tell is that increasing connections will increase all things good (usage, revenue, etc) ... basically improve all our all of our KPIs.  

We then ask the BI/tech folks how we might do this, or feed them some ideas to investigate.  They come back with ideas, that are fed into models.  These models are tested/trained and if they seem good, launched.  We go back and unleash them on the world and see what happens.  Wash, rinse, repeat.

Let's be charitable as say these iterations takes months.  Now, let's look at what LinkedIn did.  

They exposed the data they had to end users and let them do the work.  It's called "People You May Know".  FB does something similar.  More people need to be working on gathering and exposing data to end users, who can do the required BI way better than any crack team of technologists.

In the end, it is less about the amount of data we have access to than getting that little nugget of information to the right person at the right time.