Thursday, November 21, 2013

tf-idf in hive, with lessons

Tf-idf is a great way to measure relevancy.  

In my recent work, I am concerned with surfacing skills from job descriptions.  First, I n-gram all the JDs (1.25B rows).  Next, I match them against a know list of skills (10K or so).  Finally, I count them up and created a tf-idf relevancy score.  

It took a few tries to get this going at scale in Hive.  I was running into an issue where I'd get all the way to 100% mapped, and then get stuck at 70% reduced.  A candidate I was interviewing let me know it was likely running out of disk space while writing the result set (the EMR logs weren't very helpful).  

Being a lazy developer, I tried larger node footprints (and even partitions) but was still getting nowhere fast.

So I decided to split up my hive query into a number of steps.  The seemingly innocuous joining to the whole jobs table for a simple count (30M rows) and the creating of a temp table with everything but that before doing the tf-idf calc were the keys to scaling this.

I am now processing the following every morning complete with a tf-idf score.  I hope this is helpful!

jobs: 45,694,724
active jobs: 1,619,206
----------
1w active job/skills matches: 22,932,796 avg: 14.49
2w active job/skills matches: 8,806,166 avg: 6.62
3w active job/skills matches: 506,809 avg: 1.28

total job ngrams reviewed: 1,272,560,565
----------
total 1w ngrams reviewed: 297,779,164
total 2w ngrams reviewed: 475,109,902
total 3w ngrams reviewed: 499,671,499


# get the number of jobs
hive -e 'drop table job_count;'
hive -e 'create table job_count(cnt int);
INSERT OVERWRITE TABLE job_count
select count(distinct hash) as cnt FROM todays_jobs';

# coalesce by 1/2/3w n-gram matches
hive -e 'drop table all_job_skills_match;'
hive -e 'create table all_job_skills_match (hash string, company string, city string, title string, lay_skill string, ngram array<string>,  estfrequency double);
INSERT OVERWRITE TABLE all_job_skills_match

SELECT * from (

SELECT * from (
select * from 1w_job_skills_match UNION ALL
select * from 2w_job_skills_match
) XXX

UNION ALL

select * from 3w_job_skills_match) YYY

;'

# count how many job descriptions each skill shows up in
hive -e 'drop table job_skills_count'
hive -e 'create table job_skills_count(ngram array<string>, count int);
INSERT OVERWRITE TABLE job_skills_count
select ngram, count(ajsm.hash) from all_job_skills_match ajsm group by ngram;'

# create a table with all job stats, but NOT joined to the total number of jobs
hive -e 'drop table todays_job_stats';
hive -e 'create table todays_job_stats (hash string, company string, city string, title string, lay_skill string, ngram array<string>, estfrequency int, job_matches int);
INSERT OVERWRITE TABLE todays_job_stats
select jsm.hash, jsm.company, jsm.city, jsm.title, jsm.lay_skill, jsm.ngram, jsm.estfrequency, jsc.count
FROM job_skills_count jsc 
JOIN all_job_skills_match jsm
ON (jsc.ngram = jsm.ngram)
GROUP BY jsm.hash, jsm.company, jsm.city, jsm.title, jsm.lay_skill, jsm.ngram, jsm.estfrequency, jsc.count '

# finally run the tf-idf calc by joining to total number of jobs
hive -e 'select hash, company, city, title, lay_skill, ngram, estfrequency, round((estfrequency * log10(cnt/job_matches)),2) from todays_job_stats JOIN j
ob_count' > output.txt