Big Data Bloggin: August 2013

I've been doing a lot of NLP (Natural Language Processing) work in Hive of late, and earlier posted about the excellent n-gram utilities available out of the box.

That said, there are some scaling issues.

I was running into problems with this code, it ran on a few hundred rows but I was getting heap overflow errors in Hive when I ran it on more than 1000. Now, you could up the heap size, but that isn't really addressing the problem. Plus, I needed it to run on over 100K rows.

# this is fine, pulls in daily jobs, 85K or so for 08/01
hive -e 'drop table todays_jobs;'
hive -e 'create table todays_jobs(id string, title string, description string);
INSERT OVERWRITE TABLE todays_jobs
SELECT id, title, description from jobs
WHERE substring(post_date,2,10)="2013-08-01"
GROUP BY id, title, description;'

## this works on 1000 jd's ... pukes on 10000 ##
hive -e 'drop table job_skills;'
hive -e 'create table job_skills(id string, title string, NEW_ITEM ARRAY<STRUCT<ngram:array<string>, estfrequency:double>>);
INSERT OVERWRITE TABLE job_skills
SELECT id, title, context_ngrams(sentences(lower(description)), array(null), 100) as word_map
FROM todays_jobs
GROUP BY by id, title;'

I found the Hive documentation lacking, so as a public service am posting my code with some comments. Notice that I added a fourth item in the select clause (substring(title,0,2)), even though I only declare three fields while creating the table. This is where the partition value goes. Normally a date would be great to use here ... but in my case they were all the same so I had to use look elsewhere.

# I create partitions based on the first two characters of the title
# this gave me 683 partitions for 08/01 ... which slowed down the MapReduce process
# but allowed me to complete the ngram processing in the next step

hive -e 'create table todays_jobs(id string, description string, title string) PARTITIONED BY (part string);
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=1000;
INSERT OVERWRITE TABLE todays_jobs PARTITION(part)
SELECT id, description, title, substring(title,0,2) from jobs
WHERE substring(post_date,2,10)="2013-08-01"
GROUP BY id, description, title, substring(title,0,2)
;'

# now works on 80K+ rows ... extracting 100 1 word ngrams per row

hive -e 'drop table ngram_skills;'
hive -e 'create table ngram_skills(id string, title string, NEW_ITEM ARRAY<STRUCT<ngram:array<string>, estfrequency:double>>);
INSERT OVERWRITE TABLE ngram_skills
SELECT id, title, context_ngrams(sentences(lower(description)), array(null), 100) as word_map
FROM todays_jobs
group by id, title;'

What I end up here is a a table with jobs details (id and title) and an n-gram of the description. This makes for very easy matching (once I explode the ngram which is pretty damn cryptic).

hive -e 'drop table trending_words;'
hive -e 'create table trending_words (id string, title string, ngram array<string>, estfrequency double);
INSERT OVERWRITE TABLE trending_words
SELECT id, title, X.ngram, X.estfrequency from ngram_skills LATERAL VIEW explode(new_item) Z as X;'

hive -e 'select tw.id, tw.title, tw.ngram, tw.estfrequency FROM trending_words tw JOIN 1w_taxonomy tax ON (tw.ngram[0] = lower(tax.title)) where tax.title<>"title" order by tw.id, tw.estfrequency DESC limit 1000;'

And I get output like this!!!!

439235503 "Director of Marketing Kindred Central Dakotas" ["marketing"] 9.0
439235503 "Director of Marketing Kindred Central Dakotas" ["sales"] 5.0
439235503 "Director of Marketing Kindred Central Dakotas" ["leadership"] 3.0
439235503 "Director of Marketing Kindred Central Dakotas" ["training"] 2.0

439235505 "Certified Nursing Assistant I / Full Time" ["supervision"] 1.0
439235505 "Certified Nursing Assistant I / Full Time" ["collaboration"] 1.0

439235506 "CNA - Evenings, Full Time Part Time - Wareham, MA" ["collaboration"] 1.0
439235506 "CNA - Evenings, Full Time Part Time - Wareham, MA" ["supervision"] 1.0
439235507 "CNA - Full Time Part Time, Days - Wareham, MA" ["collaboration"] 1.0
439235507 "CNA - Full Time Part Time, Days - Wareham, MA" ["supervision"] 1.0

Big Data Bloggin

Friday, August 30, 2013

Real Time Intelligence talk

Tuesday, August 13, 2013

Scaling Hive Row Specific N-Grams