Wednesday, August 22, 2012

trending topics in Hive

<WARNING> I normally try to keep the Big Data discussions in this blog accessible to non-geeks, this is anything but.  There is a lot of nasty HQL, terminology, etc.  I promise the next post will get back on track. </WARNING>

So I recently unearthed a bunch of discussions and stuck them in my Hive cluster (I'm adding about 4K a day).  I loved the n-gram examples in the  Hadoop the Definitive Guide (which I loaned out and was never returned!) and in the Amazon EMR docs.  Now I had a nice data set to work with!

Unfortunately all the hive examples I've found either leveraged a publicly available n-gram set, or didn't really do anything with it.  Getting the n-grams from my hive table is easy enough with a lifted hive command and my table structure (table:discussions, column:description):
hive> SELECT explode(context_ngrams(sentences(lower(description)), array(null), 25)) AS word_map FROM discussions;
But unfortunately the results come back in a hideously ugly structure and full of words that don't tell us much.  I had some work to do ...
{"ngram":["the"],"estfrequency":82998.0}
{"ngram":["to"],"estfrequency":53626.0}
{"ngram":["and"],"estfrequency":51487.0}
{"ngram":["of"],"estfrequency":41579.0}
{"ngram":["a"],"estfrequency":37644.0}
{"ngram":["in"],"estfrequency":29572.0}
{"ngram":["i"],"estfrequency":27989.0}
{"ngram":["is"],"estfrequency":26268.0}
{"ngram":["that"],"estfrequency":23225.0}
{"ngram":["for"],"estfrequency":17557.0}
{"ngram":["be"],"estfrequency":14100.0}
{"ngram":["are"],"estfrequency":12911.0}
{"ngram":["with"],"estfrequency":12660.0}
{"ngram":["it"],"estfrequency":12618.0}
{"ngram":["this"],"estfrequency":12091.0}
{"ngram":["as"],"estfrequency":12023.0}
{"ngram":["have"],"estfrequency":11179.0}
{"ngram":["my"],"estfrequency":10446.0}
{"ngram":["or"],"estfrequency":10011.0}
{"ngram":["on"],"estfrequency":9602.0}
{"ngram":["you"],"estfrequency":9460.0}
{"ngram":["not"],"estfrequency":8521.0}
{"ngram":["they"],"estfrequency":7306.0}
{"ngram":["would"],"estfrequency":7093.0}
{"ngram":["can"],"estfrequency":7087.0}
It was clear I needed some sort of whitelist to find anything interesting (83K "the's" don't tell me much), so I borrowed this one put it into a new Hive table.

hadoop fs -mkdir /temp8/
hadoop fs -cp s3://XXX/stopwords.txt /temp8/.
hive -e  'drop table stop_words '
hive -e  'create table stop_words (word string) row format delimited fields terminated by "\t";
load data inpath "/temp8/" overwrite into table stop_words;'
Now I needed to get the data into a structure I could use.  After playing around, I was able to get the returned struct in a Hive table.  This was important because there are a lot of limitations on what you can do with Hive UDTF's (explode, etc), plus it takes a minute or so to run the n-grams query.  Trust me, it sucks.

hive -e  'drop table trending_words_struct;'
hive -e  'create table trending_words_struct (NEW_ITEM ARRAY<STRUCT<ngram:array<string>, estfrequency:double>>);
INSERT OVERWRITE TABLE trending_words_struct
SELECT context_ngrams(sentences(lower(description)), array(null), 1000) as word_map FROM discussions;'


Next, I can unlock this mess of a structure and put it into something easier to query.

hive -e  'drop table trending_words;'
hive -e  'create table trending_words (ngram string, estfrequency double);'
hive -e 'INSERT OVERWRITE TABLE trending_words select X.ngram[0], X.estfrequency  from (select explode(new_item) as X from trending_words_struct) Z;';

Finally I can match against the whitelist:
select tw.ngram, tw.estfrequency from trending_words tw LEFT OUTER JOIN stop_words sw ON (tw.ngram = sw.word) WHERE sw.word is NULL order by tw.estfrequency DESC;
 ... And get a list of non-trival words from my discussions!  Granted, I need to add stuff like can/will/etc to my whitelist, change to daily inspection, etc.  But you get the idea.

can 5096.0
will 4145.0
one 3018.0
also 2953.0
may 2397.0
children 2187.0
think 2088.0
people 2025.0
time 2007.0
use 1942.0
help 1912.0
behavior 1838.0
information 1755.0
research 1748.0
like 1673.0
work 1654.0
many 1650.0
http 1590.0
child 1552.0
make 1542.0
health 1538.0
2010 1518.0
well 1482.0
new 1481.0
used 1480.0
good 1439.0
different 1432.0
2012 1410.0
The key technical breakthroughs were:
#1 Deciding to just dump the results in their nasty structure into a Hive table
ARRAY<STRUCT<ngram:array<string>, estfrequency:double>> 
#2 Figuring out how to get those results into a usable table (that Z was a killer!)
select X.ngram[0], X.estfrequency from (select explode(new_item) as X from trending_words_struct) Z;
#3 getting the outer JOIN right
select tw.ngram, tw.estfrequency from trending_words tw LEFT OUTER JOIN stop_words sw ON (tw.ngram = sw.word) WHERE sw.word is NULL order by tw.estfrequency DESC; 

Sadly, there was a lot of trial and error.  Dead ends were frequent, a Hive bug was encountered, etc.  It took me days to get this figured out, and I wanted to document the process in hopes others benefit from my mistakes.

The cool thing is this treatment will work for just about anything.  Have at it!


Wednesday, August 8, 2012

the death of BI

Business Intelligence (BI) processes have long been tied to Data Warehouses (DW).
  1. collect a bunch of data
  2. query it for insight
  3. hypothesize
  4. validate 
  5. test in the wild
  6. goto #1
Where the BI processes fail is in speed.  This six step process takes months in most environments, and requires lots of human intervention.  I'm not kidding.  Let's try to real world this for a minute.  

So say a business partner (BP) says that they have an idea that students who read the syllabus do better in a class.  They come to the BI team who tries to see if there is data in the warehouse on syllabus reading.  If not they have to get it added, which is invariably a giant pain in the ass.  Next they need to map syllabus reading to outcomes (grades, completion, etc).  More pain and suffering.  Models are made, opinions formed, etc.

Now we have to do something about it.  

Say we make it a requirement or slap the student in the face with syllabus usage stats.  Then a deck is made with the findings supporting the thesis and business case for acting.  The deck is sold and prioritized in the development queue.  This now needs to be pushed live (in some capacity) and then tested and whatnot.  

The problem is the students who could have been helped by the process are gone, dropped out, failed, etc.  We "might" be able to help the next set of students, but not those who are already gone.  

Big Data solutions give us a near real time feedback loop by opening up access to analysis.

If we  are already collecting and saving everything, there is no need to "get stuff added to the DW".  Win #1.  Since data is stored in an unstructured format, there is no need to do any transformations either.  Win #2.  I might need to bust out some SQL-fu, but that is all.  I have everything I need at my disposal.  Now here is the kicker, that giant bastion of knowledge is not locked up on some deep storage DW with limited access (aka one way in, one way out).  It is in a live store I have query access to and the ability to process and automate.  Win #3.

So now,  I can put the onus on the end user and have them do the work for me.  Simply query the store for statistics on syllabus reading (for a particular class, term, etc) and present it directly to the student.  

"Students who read this IT193 syllabus <LINK> on average earn a .56 grade higher".

The feats of strength required to get this sort of thing done in a traditional BI/DW environment are Herculean.  They are not made for real time processing and presentation. They are made for spitting out reports and safeguarding data.

Now I'm not suggesting people go burn down the DWs and fire the BI folks.  This will be a gradual process over time, where Big Data solutions automate away the need for BI.