Monday, June 11, 2012

digging for gold

I'm pretty new to the world of textual analysis, but the problem is pretty typical of what I've seen in Big Data.

You typically break the process into three steps:
  1. getting and cleaning the data
  2. dicing it up and putting it back together in interesting ways
  3. presenting it
While #2, the big data bit, is the most sexy and fun ... you end up spending most of your time on #1 and #3 (with the added bonus of eventually having to automate the process).  Data is hard to get, ugly, and figuring out how to present it to end users is a bear.  You need to get data moving through the system, and out to presentation.

Recently I uncovered a treasure trove of discussions, literally all the discussions for about 50K university students.  Pretty much all of their day to day coursework.  Cool, eh?

Well, I first had to beg the vendor to start giving me the data.  Then when I got it, it was full of all sorts of garbage from the DW reporting tool (MS Characters, poorly formed course names, etc).  So I rehash a bunch of ruby sftp and bizarre shell scripts (sed, awk, tr, expect, etc) to get the data into a usable form.

Then I beg the vendor for key mappings so I can get from this raw data back to mine (student and course id's in this case ... in hive).  We decide to append this info on a current report (after we duplicate it).  All is well.  But, no.  Turns out that isn't going to map all the users, specifically ones who haven't been logged on for five minutes or more.  Vendor goes dark for a while despot my pleading.  Some more begging turns up a solution, in a report we are already running (for something else).  I retool all my scripts using the other report, and nothing maps.  I bang my head against the desk repeatedly while checking each step along the way.  Cleaning, parsing, reformatting, pushing, SQL?  Still nothing.  Finally I cave and go back to begging the vendor.  

We set up a meeting, etc.  They think I'm a moron (which at this point might well be true).

Meanwhile I am still hacking the data.  I document a pretty good example and send it on.  Oh wait, the vendor forgot to add a step when he automated the report.  It worked for 05/01 but no other dates.  They rerun the report for the last few days, and it maps.  Great.  I get going again and notice all sorts of duplicates, same entries on different days reports.  I ask if this is expected and a get a bizarre yes as a reply.

Thankfully I hadn't stopped hacking the data into shape while waiting for a reply.  I carried on knowing the data was wrong (duplicated), that I could fix that specific problem later (removing the dups myself).

Don't wait, you probably don't know how to interrogate the data yet anyhow.  Keep pushing.

The key is to get the up front pain and suffering of getting data flowing through the system.  Once I have the data flowing through the system, I can continually dig for gold.  That is the beauty of NOSQL.  We don't necessarily have to know what we are looking for when we start collecting data.

The real findings need to be teased out, over time.  Get the data in there, start piling it high, and THEN dig for gold.



No comments:

Post a Comment