Thursday, December 20, 2012

using MapReduce to move files

One of the biggest issues in dealing with large volumes of data is moving it around.  Thankfully, we use Amazon's S3 as a store and process via their EMR.  Access inside their network is pretty snappy.

Still, I am moving a lot of data into our Hadoop clusters every morning.  This data needs to go through our master and be distributed on to all of the slave nodes (normally 7).  The following took a hair over 20 minutes.
hadoop fs -cp s3://eCollegeGradeData/activity/*.CSV /temp4/. 

12/12/20 17:12:50 INFO s3native.NativeS3FileSystem: Opening 's3://eCollegeGradeData/activity/Activity_KU_20120127.CSV' for reading
12/12/20 17:12:51 INFO s3native.NativeS3FileSystem: Opening 's3://eCollegeGradeData/activity/Activity_KU_20120128.CSV' for reading
12/12/20 17:12:51 INFO s3native.NativeS3FileSystem: Opening
...
12/12/20 17:32:22 INFO s3native.NativeS3FileSystem: Opening 's3://eCollegeGradeData/activity/Activity_KU_Historical_Extract_20120126_7.CSV' for reading

S3DistCp is an S3 enabled version of Apache's Hadoop Disctcp that uses MapReduce (well Map, at least) to move the files directly form S3 to the Hadoop slave nodes eliminating the master node bottleneck.

hadoop distcp s3n://eCollegeGradeData/activity/ /temp4/
12/12/20 17:41:49 INFO tools.DistCp: srcPaths=[s3n://eCollegeGradeData/activity]
12/12/20 17:41:49 INFO tools.DistCp: destPath=/temp4
12/12/20 17:41:50 INFO metrics.MetricsSaver: MetricsSaver DistCp root:hdfs:///mnt/var/lib/hadoop/metrics/ period:60 instanceId:i-b0bbf4ce jobflow:j-3IZ0J8F1W334P
12/12/20 17:41:50 INFO metrics.MetricsUtil: supported product mapr-m3
12/12/20 17:41:50 INFO metrics.MetricsSaver: Disable MetricsSaver due to MapR cluster
12/12/20 17:41:51 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-80-63-76.ec2.internal/10.80.63.76:9001
12/12/20 17:41:51 INFO tools.DistCp: sourcePathsCount=337
12/12/20 17:41:51 INFO tools.DistCp: filesToCopyCount=336
12/12/20 17:41:51 INFO tools.DistCp: bytesToCopyCount=13.0g
12/12/20 17:41:51 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-80-63-76.ec2.internal/10.80.63.76:9001
12/12/20 17:41:51 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-80-63-76.ec2.internal/10.80.63.76:9001
12/12/20 17:41:51 INFO mapred.JobClient: Running job: job_201212191754_0027
12/12/20 17:41:52 INFO mapred.JobClient:  map 0% reduce 0%
12/12/20 17:42:06 INFO mapred.JobClient:  map 2% reduce 0%
12/12/20 17:42:07 INFO mapred.JobClient:  map 5% reduce 0%
12/12/20 17:42:08 INFO mapred.JobClient:  map 8% reduce 0%
12/12/20 17:42:09 INFO mapred.JobClient:  map 9% reduce 0%
12/12/20 17:42:10 INFO mapred.JobClient:  map 14% reduce 0%
12/12/20 17:42:11 INFO mapred.JobClient:  map 16% reduce 0%
12/12/20 17:42:12 INFO mapred.JobClient:  map 29% reduce 0%
12/12/20 17:42:13 INFO mapred.JobClient:  map 31% reduce 0%
12/12/20 17:42:14 INFO mapred.JobClient:  map 39% reduce 0%
12/12/20 17:42:15 INFO mapred.JobClient:  map 41% reduce 0%
12/12/20 17:42:16 INFO mapred.JobClient:  map 43% reduce 0%
12/12/20 17:42:17 INFO mapred.JobClient:  map 45% reduce 0%
12/12/20 17:42:18 INFO mapred.JobClient:  map 47% reduce 0%
12/12/20 17:42:19 INFO mapred.JobClient:  map 50% reduce 0%
12/12/20 17:42:20 INFO mapred.JobClient:  map 51% reduce 0%
12/12/20 17:42:21 INFO mapred.JobClient:  map 53% reduce 0%
12/12/20 17:42:22 INFO mapred.JobClient:  map 54% reduce 0%
12/12/20 17:42:23 INFO mapred.JobClient:  map 56% reduce 0%
12/12/20 17:42:24 INFO mapred.JobClient:  map 57% reduce 0%
12/12/20 17:42:25 INFO mapred.JobClient:  map 58% reduce 0%
12/12/20 17:42:27 INFO mapred.JobClient:  map 60% reduce 0%
12/12/20 17:42:30 INFO mapred.JobClient:  map 65% reduce 0%
12/12/20 17:42:32 INFO mapred.JobClient:  map 66% reduce 0%
12/12/20 17:42:33 INFO mapred.JobClient:  map 69% reduce 0%
12/12/20 17:42:35 INFO mapred.JobClient:  map 70% reduce 0%
12/12/20 17:42:36 INFO mapred.JobClient:  map 71% reduce 0%
12/12/20 17:42:37 INFO mapred.JobClient:  map 73% reduce 0%
12/12/20 17:42:38 INFO mapred.JobClient:  map 74% reduce 0%
12/12/20 17:42:39 INFO mapred.JobClient:  map 75% reduce 0%
12/12/20 17:42:40 INFO mapred.JobClient:  map 76% reduce 0%
12/12/20 17:42:42 INFO mapred.JobClient:  map 77% reduce 0%
12/12/20 17:42:44 INFO mapred.JobClient:  map 78% reduce 0%
12/12/20 17:42:58 INFO mapred.JobClient:  map 79% reduce 0%
12/12/20 17:43:02 INFO mapred.JobClient:  map 83% reduce 0%
12/12/20 17:43:03 INFO mapred.JobClient:  map 86% reduce 0%
12/12/20 17:43:04 INFO mapred.JobClient:  map 87% reduce 0%
12/12/20 17:43:05 INFO mapred.JobClient:  map 88% reduce 0%
12/12/20 17:43:07 INFO mapred.JobClient:  map 90% reduce 0%
12/12/20 17:43:09 INFO mapred.JobClient:  map 92% reduce 0%
12/12/20 17:43:11 INFO mapred.JobClient:  map 93% reduce 0%
12/12/20 17:43:12 INFO mapred.JobClient:  map 94% reduce 0%
12/12/20 17:43:14 INFO mapred.JobClient:  map 97% reduce 0%
12/12/20 17:43:39 INFO mapred.JobClient:  map 98% reduce 0%
12/12/20 17:43:45 INFO mapred.JobClient:  map 99% reduce 0%
12/12/20 17:43:51 INFO mapred.JobClient:  map 100% reduce 0%
12/12/20 17:46:43 INFO mapred.JobClient: Job complete: job_201212191754_0027
12/12/20 17:46:43 INFO mapred.JobClient: Counters: 21
12/12/20 17:46:43 INFO mapred.JobClient:   Job Counters
12/12/20 17:46:43 INFO mapred.JobClient:     Aggregate execution time of mappers(ms)=1924883
12/12/20 17:46:43 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/12/20 17:46:43 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/12/20 17:46:43 INFO mapred.JobClient:     Launched map tasks=34
12/12/20 17:46:43 INFO mapred.JobClient:     Aggregate execution time of reducers(ms)=0
12/12/20 17:46:43 INFO mapred.JobClient:   FileSystemCounters
12/12/20 17:46:43 INFO mapred.JobClient:     MAPRFS_BYTES_READ=84770
12/12/20 17:46:43 INFO mapred.JobClient:     S3N_BYTES_READ=13971069631
12/12/20 17:46:43 INFO mapred.JobClient:     MAPRFS_BYTES_WRITTEN=13971069631
12/12/20 17:46:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1659360
12/12/20 17:46:43 INFO mapred.JobClient:   distcp
12/12/20 17:46:43 INFO mapred.JobClient:     Files copied=336
12/12/20 17:46:43 INFO mapred.JobClient:     Bytes copied=13971069631
12/12/20 17:46:43 INFO mapred.JobClient:     Bytes expected=13971069631
12/12/20 17:46:43 INFO mapred.JobClient:   Map-Reduce Framework
12/12/20 17:46:43 INFO mapred.JobClient:     Map input records=336
12/12/20 17:46:43 INFO mapred.JobClient:     PHYSICAL_MEMORY_BYTES=4710170624
12/12/20 17:46:43 INFO mapred.JobClient:     Spilled Records=0
12/12/20 17:46:43 INFO mapred.JobClient:     CPU_MILLISECONDS=530380
12/12/20 17:46:43 INFO mapred.JobClient:     VIRTUAL_MEMORY_BYTES=56869019648
12/12/20 17:46:43 INFO mapred.JobClient:     Map input bytes=50426
12/12/20 17:46:43 INFO mapred.JobClient:     Map output records=0
12/12/20 17:46:43 INFO mapred.JobClient:     SPLIT_RAW_BYTES=5100
12/12/20 17:46:43 INFO mapred.JobClient:     GC time elapsed (ms)=3356
12/12/20 17:46:43 INFO metrics.MetricsSaver: Inside MetricsSaver Shutdown Hook

Now, I should have gone looking for a better solution months ago ... but frankly this wasn't that huge of an issue.  I went from 20 minutes to 5 on this particular copy and have seen go even lower than that.  If you add it all up, I was able to reduce my execution time by half an hour.

Not too shabby!


Friday, November 16, 2012

Machine Learning - predicting outcomes from activity

So I have been meandering through Andrew Ng's (Stanford) excellent Machine Learning class at Coursera, which I highly recommend.  

My progress in the course has been slowed by the fact I keep running across things I can use, and now. I'm in the unique position to have an applicable data set as well as real world use cases.  Once I found Mahout (scalable machine learning libraries that run on top of Hadoop) I was in.

I'm far less interested in the ins and outs of the actual ML algorithms than what they can do for me.  So, once I had a basic understanding of Classification Systems I started hacking.  

In my earlier (and kludgier) efforts at ML I had to define success, and then measure current students against whatever benchmark I set (based on historical data).  This means lots of manual code, repeated interrogation of the data, and scaling issues.

ML algorithms can help with all of these issues.


• Train a model on a data set.
• Test on a subset (if good continue).
• Use the model with live data to make predictions.
• Leverage those predictions to improve outcomes.


I'm currently leaning on logistic regression, but there are other algorithms in the toolshed depending on your problem set.




Think of my data as a giant matrix (well at least the numeric fields), that gets plugged in to the above equation.  Each line in the matrix is a training example.  The "features" I use are "predictors" in Mahout parlance and the classifier the "target".  

So I took the following snapshot of data from CJ101 classes in 2012, of all student activity until day 14, and appended the outcomes (grade, success).  We call these our training examples (~4200 used here, ~200K in total for 2012).

username,class,term,section,pct,mins,subs,days,grade,success …
...
TheBeaver,CJ101,1201A,14,0.97,12860.97,235,40,A,1 
WardCleaver,CJ101,1202B,05,0.86,2930.28,90,20,F,0
JuneCleaver,CJ101,1202C,16,0.94,6432.99,124,28,C,1
EddieHaskel,CJ101,1201C,03,1.0,7257.34,96,40,A,1
WallyCleaver,CJ101,1202B,18,0.81,29736.2,462,54,A,1
LumpyRutheford,CJ101,1201A,11,0.78,11685.95,308,48,C,1
WhiteyWitney,CJ101,1201A,03,0.96,37276.78,415,63,A,1
LarryMondello,CJ101,1202C,10,0.73,7223.47,150,22,F,0
MissLanders,CJ101,1204A,04,0.98,13717.4,292,57,B,1 
.. 

Here I set "success" to 1 for good student outcomes (A-C) and 0 for bad ones (D and below).  My goal is to train a model that uses my numeric features (pct, mins, days, subs) to predict the "success" value.

mahout trainlogistic --input CJ101_data.csv --output model --target success --categories 2 --predictors pct mins days subs  --types numeric --output ./model 

success ~ 0.001*Intercept Term + 0.009*days + -0.001*mins + 0.000*pct + 0.022*subs       
Intercept Term 0.00058                
days 0.00878                
mins -0.00119                 
pct 0.00041                
subs 0.02231

So the Mahout picks the weighting for my features which are used in the logistic regression, basically telling me how important each on of these terms are.  Looking at this particular class ... days and subs are much more important that mins or pct.

Now I need to test how good I was at predicting the success rate based on time, subs, days, pct.

mahout runlogistic –input CJ101_data.csv --model model  --auc –confusion 

AUC = 0.74
confusion: [[1594.0, 163.0], [16.0, 2.0]]
entropy: [[-0.0, -0.0], [-14.9, -2.6]]
12/11/14 15:32:14 INFO driver.MahoutDriver: Program took 844 ms 
* AUC is the success of our prediction

I can actually fiddle around and get this much higher (while making the predictions more usable as well), but let's save that for another posts.  The long and short is I can predict the outcomes of CJ101 students just 14 days into a 70 day class with 74% accuracy.

This information can not only identify struggling students very early into a class, but be leveraged in automated interventions.

Not too shabby at all!






Wednesday, October 3, 2012

Thursday, September 27, 2012

Data Week - an exercise in humility

So I just spent a few days at Data Week in San Francisco, and was it ever eye opening.

I've always claimed that I have more complexity issues, tying together large data sets, than actual Big Data problems.  I run a smallish 8 node ~50GB EMR cluster at Amazon, processing a few hours daily.  My issues are generally around throughput, getting data in and out of my cluster in a reasonable time (say < 2hrs).

Then I watch the BigQuery guys from Google doing a full table scan on a 500TB table in a minute.  Ouch.

In session after session numbers in the 100s of Terabytes or even Petabytes were being geek dropped, dwarfing anything I've seen here in our local Chicago Hadoop/BigData/Machine Learning groups.  There were interesting discussions just about moving that volume of information around.

What I liked most what the interest in extracting value (information) from data, not necessarily in creating the biggest pile.  I think the R folks (from Revolution Analytics) were spot on here.  I need to spend some time there.

It was also nice to see the lack of BI tool vendors pitching how their wares could "integrate" in some lame ass way to a Hadoop cluster (JDBC anyone???).  This sometimes infects Big Data gatherings and is insufferable.  So happy to avoid it.

A friend in sales recently asked ...

What Do BI Vendors Mean When They Say They Integrate With Hadoop?

My reply was "Bullshit. About 1/3 of what they tell you."





Wednesday, August 22, 2012

trending topics in Hive

<WARNING> I normally try to keep the Big Data discussions in this blog accessible to non-geeks, this is anything but.  There is a lot of nasty HQL, terminology, etc.  I promise the next post will get back on track. </WARNING>

So I recently unearthed a bunch of discussions and stuck them in my Hive cluster (I'm adding about 4K a day).  I loved the n-gram examples in the  Hadoop the Definitive Guide (which I loaned out and was never returned!) and in the Amazon EMR docs.  Now I had a nice data set to work with!

Unfortunately all the hive examples I've found either leveraged a publicly available n-gram set, or didn't really do anything with it.  Getting the n-grams from my hive table is easy enough with a lifted hive command and my table structure (table:discussions, column:description):
hive> SELECT explode(context_ngrams(sentences(lower(description)), array(null), 25)) AS word_map FROM discussions;
But unfortunately the results come back in a hideously ugly structure and full of words that don't tell us much.  I had some work to do ...
{"ngram":["the"],"estfrequency":82998.0}
{"ngram":["to"],"estfrequency":53626.0}
{"ngram":["and"],"estfrequency":51487.0}
{"ngram":["of"],"estfrequency":41579.0}
{"ngram":["a"],"estfrequency":37644.0}
{"ngram":["in"],"estfrequency":29572.0}
{"ngram":["i"],"estfrequency":27989.0}
{"ngram":["is"],"estfrequency":26268.0}
{"ngram":["that"],"estfrequency":23225.0}
{"ngram":["for"],"estfrequency":17557.0}
{"ngram":["be"],"estfrequency":14100.0}
{"ngram":["are"],"estfrequency":12911.0}
{"ngram":["with"],"estfrequency":12660.0}
{"ngram":["it"],"estfrequency":12618.0}
{"ngram":["this"],"estfrequency":12091.0}
{"ngram":["as"],"estfrequency":12023.0}
{"ngram":["have"],"estfrequency":11179.0}
{"ngram":["my"],"estfrequency":10446.0}
{"ngram":["or"],"estfrequency":10011.0}
{"ngram":["on"],"estfrequency":9602.0}
{"ngram":["you"],"estfrequency":9460.0}
{"ngram":["not"],"estfrequency":8521.0}
{"ngram":["they"],"estfrequency":7306.0}
{"ngram":["would"],"estfrequency":7093.0}
{"ngram":["can"],"estfrequency":7087.0}
It was clear I needed some sort of whitelist to find anything interesting (83K "the's" don't tell me much), so I borrowed this one put it into a new Hive table.

hadoop fs -mkdir /temp8/
hadoop fs -cp s3://XXX/stopwords.txt /temp8/.
hive -e  'drop table stop_words '
hive -e  'create table stop_words (word string) row format delimited fields terminated by "\t";
load data inpath "/temp8/" overwrite into table stop_words;'
Now I needed to get the data into a structure I could use.  After playing around, I was able to get the returned struct in a Hive table.  This was important because there are a lot of limitations on what you can do with Hive UDTF's (explode, etc), plus it takes a minute or so to run the n-grams query.  Trust me, it sucks.

hive -e  'drop table trending_words_struct;'
hive -e  'create table trending_words_struct (NEW_ITEM ARRAY<STRUCT<ngram:array<string>, estfrequency:double>>);
INSERT OVERWRITE TABLE trending_words_struct
SELECT context_ngrams(sentences(lower(description)), array(null), 1000) as word_map FROM discussions;'


Next, I can unlock this mess of a structure and put it into something easier to query.

hive -e  'drop table trending_words;'
hive -e  'create table trending_words (ngram string, estfrequency double);'
hive -e 'INSERT OVERWRITE TABLE trending_words select X.ngram[0], X.estfrequency  from (select explode(new_item) as X from trending_words_struct) Z;';

Finally I can match against the whitelist:
select tw.ngram, tw.estfrequency from trending_words tw LEFT OUTER JOIN stop_words sw ON (tw.ngram = sw.word) WHERE sw.word is NULL order by tw.estfrequency DESC;
 ... And get a list of non-trival words from my discussions!  Granted, I need to add stuff like can/will/etc to my whitelist, change to daily inspection, etc.  But you get the idea.

can 5096.0
will 4145.0
one 3018.0
also 2953.0
may 2397.0
children 2187.0
think 2088.0
people 2025.0
time 2007.0
use 1942.0
help 1912.0
behavior 1838.0
information 1755.0
research 1748.0
like 1673.0
work 1654.0
many 1650.0
http 1590.0
child 1552.0
make 1542.0
health 1538.0
2010 1518.0
well 1482.0
new 1481.0
used 1480.0
good 1439.0
different 1432.0
2012 1410.0
The key technical breakthroughs were:
#1 Deciding to just dump the results in their nasty structure into a Hive table
ARRAY<STRUCT<ngram:array<string>, estfrequency:double>> 
#2 Figuring out how to get those results into a usable table (that Z was a killer!)
select X.ngram[0], X.estfrequency from (select explode(new_item) as X from trending_words_struct) Z;
#3 getting the outer JOIN right
select tw.ngram, tw.estfrequency from trending_words tw LEFT OUTER JOIN stop_words sw ON (tw.ngram = sw.word) WHERE sw.word is NULL order by tw.estfrequency DESC; 

Sadly, there was a lot of trial and error.  Dead ends were frequent, a Hive bug was encountered, etc.  It took me days to get this figured out, and I wanted to document the process in hopes others benefit from my mistakes.

The cool thing is this treatment will work for just about anything.  Have at it!


Wednesday, August 8, 2012

the death of BI

Business Intelligence (BI) processes have long been tied to Data Warehouses (DW).
  1. collect a bunch of data
  2. query it for insight
  3. hypothesize
  4. validate 
  5. test in the wild
  6. goto #1
Where the BI processes fail is in speed.  This six step process takes months in most environments, and requires lots of human intervention.  I'm not kidding.  Let's try to real world this for a minute.  

So say a business partner (BP) says that they have an idea that students who read the syllabus do better in a class.  They come to the BI team who tries to see if there is data in the warehouse on syllabus reading.  If not they have to get it added, which is invariably a giant pain in the ass.  Next they need to map syllabus reading to outcomes (grades, completion, etc).  More pain and suffering.  Models are made, opinions formed, etc.

Now we have to do something about it.  

Say we make it a requirement or slap the student in the face with syllabus usage stats.  Then a deck is made with the findings supporting the thesis and business case for acting.  The deck is sold and prioritized in the development queue.  This now needs to be pushed live (in some capacity) and then tested and whatnot.  

The problem is the students who could have been helped by the process are gone, dropped out, failed, etc.  We "might" be able to help the next set of students, but not those who are already gone.  

Big Data solutions give us a near real time feedback loop by opening up access to analysis.

If we  are already collecting and saving everything, there is no need to "get stuff added to the DW".  Win #1.  Since data is stored in an unstructured format, there is no need to do any transformations either.  Win #2.  I might need to bust out some SQL-fu, but that is all.  I have everything I need at my disposal.  Now here is the kicker, that giant bastion of knowledge is not locked up on some deep storage DW with limited access (aka one way in, one way out).  It is in a live store I have query access to and the ability to process and automate.  Win #3.

So now,  I can put the onus on the end user and have them do the work for me.  Simply query the store for statistics on syllabus reading (for a particular class, term, etc) and present it directly to the student.  

"Students who read this IT193 syllabus <LINK> on average earn a .56 grade higher".

The feats of strength required to get this sort of thing done in a traditional BI/DW environment are Herculean.  They are not made for real time processing and presentation. They are made for spitting out reports and safeguarding data.

Now I'm not suggesting people go burn down the DWs and fire the BI folks.  This will be a gradual process over time, where Big Data solutions automate away the need for BI.  


















Wednesday, July 11, 2012

How big is your Hadoop cluster?

I've been thinking about a nice way to find out how much data was in my Hadoop cluster for an embarrassing amount of time.  You can easily go table by table in the hive shell, but I have a dozen or so tables making it pretty taxing to ask an easy question.  So it is pretty lame.
hive> SHOW TABLE EXTENDED in default LIKE word_frequency;
Another option is to use a reporting tool, like Ganglia (my tool of choice).  While it shows my daily max/min/average I don't have a report of what I'm using "at rest".  I eventually found a drop down that gives actual memory usage (second image), but that was evenly split across my nodes.  (7.6GBx8 = ~30GB).  Way too regular.



So finally it dawns on me to just ask HDFS, and after screwing around a bit it seems the "/" is recursive  as well as cumulative.  Nice!  The ~22GB was exactly the number I had roughed by looking at the individual tables.  Now why didn't I think of that from the start?  
$ hadoop fs -du /
Found 1 items
23670012101  hdfs://10.80.225.26:9000/mnt
As always, the command line rules the roost.

Friday, June 22, 2012

fear and loathing in the Hadoop Cluster

What I've noticed most on the business end of Big Data is the love/hate dichotomy.

I'm not talking about DW or BI folks who underestimate the promise of never having to throw data away, using commodity hardware, realizing the scaling efficiencies of the cloud, etc.  It will take time, but they are at least coming around.

I'm talking about folks who are actually scared of data.

Now, I'm a pretty open guy.  I actually like getting proven wrong, which happens an astonishing amount of the time.  It won't stop me from having an opinion, arguing for what I think is right, etc.  It is just the Socratic worldview that I employ.  My goal is to get better, learn more ... not be right.

As I've amassed a good amount of our working business data, I talk it up.  A lot.  To anyone who will listen.  What was shocking from my end were the reactions I often get:

  1. "the data is wrong"
  2. "our data is better" (even if only a small subset ), cousin of #1
  3. "what you are looking at doesn't matter ... what really matters is X" 
  4. "we can't possibly show that to end users without causing a thermonuclear meltdown"
  5. <heads firmly implanted in sand>
People are pretty tied to their preconceived notions, and while I've had some very receptive and faithful allies across our business units there are a lot more people who'd rather I go away.  If my data doesn't prove their line of thinking, they'll do what they can to squash or ignore it.

I doubt I'm alone here.  

My guess is that while businesses want to be results driven, it sounds better in a powerpoint deck than in reality.  They need to be a bit more open to what they might find, even it it's not good.  I once got an "A" on a 20 page Math Paper in college for proving that I was an idiot.  

The future will not be won by the blissfully ignorant, but those willing to learn their way forwards. 

Business successfully leveraging BigData will follow an iterative, lean, and inquisitive mindset.  Twitter started out as a file sharing app, and Flickr as a game, Nokia making paper/galoshes, IBM actually made stuff.  Call it a pivot, iteration, rehash ... whatever you like.  

I call it winning.  

Big Data can help business known when and how to pivot, giving insight into black holes and unverifiable ideas.  It won't solve all our problems, but let's not dismiss what makes us uncomfortable.




Monday, June 11, 2012

digging for gold

I'm pretty new to the world of textual analysis, but the problem is pretty typical of what I've seen in Big Data.

You typically break the process into three steps:
  1. getting and cleaning the data
  2. dicing it up and putting it back together in interesting ways
  3. presenting it
While #2, the big data bit, is the most sexy and fun ... you end up spending most of your time on #1 and #3 (with the added bonus of eventually having to automate the process).  Data is hard to get, ugly, and figuring out how to present it to end users is a bear.  You need to get data moving through the system, and out to presentation.

Recently I uncovered a treasure trove of discussions, literally all the discussions for about 50K university students.  Pretty much all of their day to day coursework.  Cool, eh?

Well, I first had to beg the vendor to start giving me the data.  Then when I got it, it was full of all sorts of garbage from the DW reporting tool (MS Characters, poorly formed course names, etc).  So I rehash a bunch of ruby sftp and bizarre shell scripts (sed, awk, tr, expect, etc) to get the data into a usable form.

Then I beg the vendor for key mappings so I can get from this raw data back to mine (student and course id's in this case ... in hive).  We decide to append this info on a current report (after we duplicate it).  All is well.  But, no.  Turns out that isn't going to map all the users, specifically ones who haven't been logged on for five minutes or more.  Vendor goes dark for a while despot my pleading.  Some more begging turns up a solution, in a report we are already running (for something else).  I retool all my scripts using the other report, and nothing maps.  I bang my head against the desk repeatedly while checking each step along the way.  Cleaning, parsing, reformatting, pushing, SQL?  Still nothing.  Finally I cave and go back to begging the vendor.  

We set up a meeting, etc.  They think I'm a moron (which at this point might well be true).

Meanwhile I am still hacking the data.  I document a pretty good example and send it on.  Oh wait, the vendor forgot to add a step when he automated the report.  It worked for 05/01 but no other dates.  They rerun the report for the last few days, and it maps.  Great.  I get going again and notice all sorts of duplicates, same entries on different days reports.  I ask if this is expected and a get a bizarre yes as a reply.

Thankfully I hadn't stopped hacking the data into shape while waiting for a reply.  I carried on knowing the data was wrong (duplicated), that I could fix that specific problem later (removing the dups myself).

Don't wait, you probably don't know how to interrogate the data yet anyhow.  Keep pushing.

The key is to get the up front pain and suffering of getting data flowing through the system.  Once I have the data flowing through the system, I can continually dig for gold.  That is the beauty of NOSQL.  We don't necessarily have to know what we are looking for when we start collecting data.

The real findings need to be teased out, over time.  Get the data in there, start piling it high, and THEN dig for gold.



Wednesday, May 30, 2012

Big Data ... say what?

Everyone is talking about Big Data, which probably has 10 different definitions depending on who you are talking to, but no one is blogging.  Seems odd.

My goal is to have an open and honest look at what Big Data means and can offer, with a focus on who is doing it right.  My gut tells me that the promise of Big Data is more transformative than incremental, and that of the solutions and use cases one sees peddled very few lean in that direction.

Let's start with a few of the buzzwords frequently thrown around with Big Data:

  • data warehousing
  • business intelligence
  • KPI
  • structured/unstructured data (NOSQL)
  • data visualization 
  • machine learning
I've gone from least to most interesting here.  The reason I say that is even though the era of Big Data is and will continue to impact the DW and BI worlds, those are pretty limited and very costly things to do.  You are generally only going to see incremental gains ... and costly ones at that.  

That doesn't mean that DWs aren't going to need to reorganize in an age where we don't really ever have to throw away data, or BI tools aren't going to need the ability to access hadoop clusters for dealing with larger data sets.  They are.  

But there are fundamental problems with traditional BI, it is resource intensive and slow as a feedback loop.

Let's say we want to do a typical BI project.  We convene a group of product, technical, and BI folks to talk over an idea.  Say we want to increase the number of connections someone on a network has.  The product folks tell is that increasing connections will increase all things good (usage, revenue, etc) ... basically improve all our all of our KPIs.  

We then ask the BI/tech folks how we might do this, or feed them some ideas to investigate.  They come back with ideas, that are fed into models.  These models are tested/trained and if they seem good, launched.  We go back and unleash them on the world and see what happens.  Wash, rinse, repeat.

Let's be charitable as say these iterations takes months.  Now, let's look at what LinkedIn did.  

They exposed the data they had to end users and let them do the work.  It's called "People You May Know".  FB does something similar.  More people need to be working on gathering and exposing data to end users, who can do the required BI way better than any crack team of technologists.

In the end, it is less about the amount of data we have access to than getting that little nugget of information to the right person at the right time.