Thursday, November 21, 2013

tf-idf in hive, with lessons

Tf-idf is a great way to measure relevancy.  

In my recent work, I am concerned with surfacing skills from job descriptions.  First, I n-gram all the JDs (1.25B rows).  Next, I match them against a know list of skills (10K or so).  Finally, I count them up and created a tf-idf relevancy score.  

It took a few tries to get this going at scale in Hive.  I was running into an issue where I'd get all the way to 100% mapped, and then get stuck at 70% reduced.  A candidate I was interviewing let me know it was likely running out of disk space while writing the result set (the EMR logs weren't very helpful).  

Being a lazy developer, I tried larger node footprints (and even partitions) but was still getting nowhere fast.

So I decided to split up my hive query into a number of steps.  The seemingly innocuous joining to the whole jobs table for a simple count (30M rows) and the creating of a temp table with everything but that before doing the tf-idf calc were the keys to scaling this.

I am now processing the following every morning complete with a tf-idf score.  I hope this is helpful!

jobs: 45,694,724
active jobs: 1,619,206
----------
1w active job/skills matches: 22,932,796 avg: 14.49
2w active job/skills matches: 8,806,166 avg: 6.62
3w active job/skills matches: 506,809 avg: 1.28

total job ngrams reviewed: 1,272,560,565
----------
total 1w ngrams reviewed: 297,779,164
total 2w ngrams reviewed: 475,109,902
total 3w ngrams reviewed: 499,671,499


# get the number of jobs
hive -e 'drop table job_count;'
hive -e 'create table job_count(cnt int);
INSERT OVERWRITE TABLE job_count
select count(distinct hash) as cnt FROM todays_jobs';

# coalesce by 1/2/3w n-gram matches
hive -e 'drop table all_job_skills_match;'
hive -e 'create table all_job_skills_match (hash string, company string, city string, title string, lay_skill string, ngram array<string>,  estfrequency double);
INSERT OVERWRITE TABLE all_job_skills_match

SELECT * from (

SELECT * from (
select * from 1w_job_skills_match UNION ALL
select * from 2w_job_skills_match
) XXX

UNION ALL

select * from 3w_job_skills_match) YYY

;'

# count how many job descriptions each skill shows up in
hive -e 'drop table job_skills_count'
hive -e 'create table job_skills_count(ngram array<string>, count int);
INSERT OVERWRITE TABLE job_skills_count
select ngram, count(ajsm.hash) from all_job_skills_match ajsm group by ngram;'

# create a table with all job stats, but NOT joined to the total number of jobs
hive -e 'drop table todays_job_stats';
hive -e 'create table todays_job_stats (hash string, company string, city string, title string, lay_skill string, ngram array<string>, estfrequency int, job_matches int);
INSERT OVERWRITE TABLE todays_job_stats
select jsm.hash, jsm.company, jsm.city, jsm.title, jsm.lay_skill, jsm.ngram, jsm.estfrequency, jsc.count
FROM job_skills_count jsc 
JOIN all_job_skills_match jsm
ON (jsc.ngram = jsm.ngram)
GROUP BY jsm.hash, jsm.company, jsm.city, jsm.title, jsm.lay_skill, jsm.ngram, jsm.estfrequency, jsc.count '

# finally run the tf-idf calc by joining to total number of jobs
hive -e 'select hash, company, city, title, lay_skill, ngram, estfrequency, round((estfrequency * log10(cnt/job_matches)),2) from todays_job_stats JOIN j
ob_count' > output.txt

11 comments:

  1. Hi, great post !
    Any idea where can I find an up to date list of skills ?
    Or how to detect them from corpus of job descriptions ?
    Thanks !

    ReplyDelete
  2. Hello There,

    Best thing I have read in a while on this Big Data Bloggin . There should be a standing ovation button. This is a great piece.

    I am new to Linux OS but was in a conversation with an employee at a local computer store and I mentioned that I would like to use an older pc currently running Win XP as a network storage pc. He mentioned that a Linux OS would be better suited for that purpose. Is that true? If so, what system should I use?


    Very useful post !everyone should learn and use it during their learning path.

    Kind Regards,
    Radhey

    ReplyDelete
  3. thanks for sharing such a nice info.I hope you will share more information like this. please keep on sharing!Awesome article! It is in detail and well formatted that i enjoyed reading. which inturn helped me to get new information from your blog. After reading your article I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your articleJava training in Chennai

    Java Online training in Chennai

    Java Course in Chennai

    Best JAVA Training Institutes in Chennai

    Java training in Bangalore

    Java training in Hyderabad

    Java Training in Coimbatore

    Java Training

    Java Online Training

    ReplyDelete
  4. Quite Interesting post!!! Thanks for posting such a useful post. I wish to read your upcoming post to enhance my skill set, keep blogging.I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    selenium training in chennai

    selenium training in chennai

    selenium online training in chennai

    software testing training in chennai

    selenium training in bangalore

    selenium training in hyderabad

    selenium training in coimbatore

    selenium online training

    selenium training

    ReplyDelete
  5. Thanks for sharing an informative blog keep rocking bring more details.I like the helpful info you provide in your articles. I’ll bookmark your weblog and check again here regularly.

    Software Testing Training in Chennai | Certification | Online
    Courses



    Software Testing Training in Chennai

    Software Testing Online Training in Chennai

    Software Testing Courses in Chennai

    Software Testing Training in Bangalore

    Software Testing Training in Hyderabad

    Software Testing Training in Coimbatore

    Software Testing Training

    Software Testing Online Training

    ReplyDelete
  6. This blog is the general information for the feature. You got a good work for these blog.We have a developing our creative content of this mind.Thank you for this blog. This for very interesting and useful.

    angular js training in chennai

    angular training in chennai

    angular js online training in chennai

    angular js training in bangalore

    angular js training in hyderabad

    angular js training in coimbatore

    angular js training

    angular js online training

    ReplyDelete
  7. I found your blog while searching for the updates, I am happy to be here. Very useful content and also easily understandable providing.This is a wonderful article, Given so much info in it, Thanks for sharing.
    DevOps Training in Chennai

    DevOps Online Training in Chennai

    DevOps Training in Bangalore

    DevOps Training in Hyderabad

    DevOps Training in Coimbatore

    DevOps Training

    DevOps Online Training

    ReplyDelete