Big Data Bloggin: November 2013

Tf-idf is a great way to measure relevancy.

In my recent work, I am concerned with surfacing skills from job descriptions. First, I n-gram all the JDs (1.25B rows). Next, I match them against a know list of skills (10K or so). Finally, I count them up and created a tf-idf relevancy score.

It took a few tries to get this going at scale in Hive. I was running into an issue where I'd get all the way to 100% mapped, and then get stuck at 70% reduced. A candidate I was interviewing let me know it was likely running out of disk space while writing the result set (the EMR logs weren't very helpful).

Being a lazy developer, I tried larger node footprints (and even partitions) but was still getting nowhere fast.

So I decided to split up my hive query into a number of steps. The seemingly innocuous joining to the whole jobs table for a simple count (30M rows) and the creating of a temp table with everything but that before doing the tf-idf calc were the keys to scaling this.

I am now processing the following every morning complete with a tf-idf score. I hope this is helpful!

jobs: 45,694,724

active jobs: 1,619,206

----------

1w active job/skills matches: 22,932,796 avg: 14.49

2w active job/skills matches: 8,806,166 avg: 6.62

3w active job/skills matches: 506,809 avg: 1.28

total job ngrams reviewed: 1,272,560,565

----------

total 1w ngrams reviewed: 297,779,164

total 2w ngrams reviewed: 475,109,902

total 3w ngrams reviewed: 499,671,499

# get the number of jobs

hive -e 'drop table job_count;'

hive -e 'create table job_count(cnt int);

INSERT OVERWRITE TABLE job_count

select count(distinct hash) as cnt FROM todays_jobs';

# coalesce by 1/2/3w n-gram matches

hive -e 'drop table all_job_skills_match;'

hive -e 'create table all_job_skills_match (hash string, company string, city string, title string, lay_skill string, ngram array<string>,  estfrequency double);

INSERT OVERWRITE TABLE all_job_skills_match

SELECT * from (

SELECT * from (

select * from 1w_job_skills_match UNION ALL

select * from 2w_job_skills_match

) XXX

UNION ALL

select * from 3w_job_skills_match) YYY

;'

# count how many job descriptions each skill shows up in

hive -e 'drop table job_skills_count'

hive -e 'create table job_skills_count(ngram array<string>, count int);

INSERT OVERWRITE TABLE job_skills_count

select ngram, count(ajsm.hash) from all_job_skills_match ajsm group by ngram;'

# create a table with all job stats, but NOT joined to the total number of jobs

hive -e 'drop table todays_job_stats';

hive -e 'create table todays_job_stats (hash string, company string, city string, title string, lay_skill string, ngram array<string>, estfrequency int, job_matches int);

INSERT OVERWRITE TABLE todays_job_stats

select jsm.hash, jsm.company, jsm.city, jsm.title, jsm.lay_skill, jsm.ngram, jsm.estfrequency, jsc.count

FROM job_skills_count jsc 

JOIN all_job_skills_match jsm

ON (jsc.ngram = jsm.ngram)

GROUP BY jsm.hash, jsm.company, jsm.city, jsm.title, jsm.lay_skill, jsm.ngram, jsm.estfrequency, jsc.count '

# finally run the tf-idf calc by joining to total number of jobs

hive -e 'select hash, company, city, title, lay_skill, ngram, estfrequency, round((estfrequency * log10(cnt/job_matches)),2) from todays_job_stats JOIN j

ob_count' > output.txt

Big Data Bloggin

Thursday, November 21, 2013

tf-idf in hive, with lessons