In my recent work, I am concerned with surfacing skills from job descriptions. First, I n-gram all the JDs (1.25B rows). Next, I match them against a know list of skills (10K or so). Finally, I count them up and created a tf-idf relevancy score.
It took a few tries to get this going at scale in Hive. I was running into an issue where I'd get all the way to 100% mapped, and then get stuck at 70% reduced. A candidate I was interviewing let me know it was likely running out of disk space while writing the result set (the EMR logs weren't very helpful).
Being a lazy developer, I tried larger node footprints (and even partitions) but was still getting nowhere fast.
So I decided to split up my hive query into a number of steps. The seemingly innocuous joining to the whole jobs table for a simple count (30M rows) and the creating of a temp table with everything but that before doing the tf-idf calc were the keys to scaling this.
I am now processing the following every morning complete with a tf-idf score. I hope this is helpful!
jobs: 45,694,724
active jobs: 1,619,206
----------
1w active job/skills matches: 22,932,796 avg: 14.49
2w active job/skills matches: 8,806,166 avg: 6.62
3w active job/skills matches: 506,809 avg: 1.28
total job ngrams reviewed: 1,272,560,565
----------
total 1w ngrams reviewed: 297,779,164
total 2w ngrams reviewed: 475,109,902
total 3w ngrams reviewed: 499,671,499
# get the number of jobs
hive -e 'drop table job_count;'
hive -e 'create table job_count(cnt int);
INSERT OVERWRITE TABLE job_count
select count(distinct hash) as cnt FROM todays_jobs';
# coalesce by 1/2/3w n-gram matches
hive -e 'drop table all_job_skills_match;'
hive -e 'create table all_job_skills_match (hash string, company string, city string, title string, lay_skill string, ngram array<string>, estfrequency double);
INSERT OVERWRITE TABLE all_job_skills_match
SELECT * from (
SELECT * from (
select * from 1w_job_skills_match UNION ALL
select * from 2w_job_skills_match
) XXX
UNION ALL
select * from 3w_job_skills_match) YYY
;'
# count how many job descriptions each skill shows up in
hive -e 'drop table job_skills_count'
hive -e 'create table job_skills_count(ngram array<string>, count int);
INSERT OVERWRITE TABLE job_skills_count
select ngram, count(ajsm.hash) from all_job_skills_match ajsm group by ngram;'
# create a table with all job stats, but NOT joined to the total number of jobs
hive -e 'drop table todays_job_stats';
hive -e 'create table todays_job_stats (hash string, company string, city string, title string, lay_skill string, ngram array<string>, estfrequency int, job_matches int);
INSERT OVERWRITE TABLE todays_job_stats
select jsm.hash, jsm.company, jsm.city, jsm.title, jsm.lay_skill, jsm.ngram, jsm.estfrequency, jsc.count
FROM job_skills_count jsc
JOIN all_job_skills_match jsm
ON (jsc.ngram = jsm.ngram)
GROUP BY jsm.hash, jsm.company, jsm.city, jsm.title, jsm.lay_skill, jsm.ngram, jsm.estfrequency, jsc.count '
# finally run the tf-idf calc by joining to total number of jobs
hive -e 'select hash, company, city, title, lay_skill, ngram, estfrequency, round((estfrequency * log10(cnt/job_matches)),2) from todays_job_stats JOIN j
ob_count' > output.txt
Hi, great post !
ReplyDeleteAny idea where can I find an up to date list of skills ?
Or how to detect them from corpus of job descriptions ?
Thanks !
very nice interview questions
ReplyDeletevlsi interview questions
extjs interview questions
laravel interview questions sap bi/bw interview questions pcb interview questions unix shell scripting interview questions
really bawesome blog
ReplyDeletehr interview questions
hibernate interview questions
selenium interview questions
c interview questions
c++ interview questions
linux interview questions
Hello There,
ReplyDeleteBest thing I have read in a while on this Big Data Bloggin . There should be a standing ovation button. This is a great piece.
I am new to Linux OS but was in a conversation with an employee at a local computer store and I mentioned that I would like to use an older pc currently running Win XP as a network storage pc. He mentioned that a Linux OS would be better suited for that purpose. Is that true? If so, what system should I use?
Very useful post !everyone should learn and use it during their learning path.
Kind Regards,
Radhey
thanks for sharing such a nice info.I hope you will share more information like this. please keep on sharing!Awesome article! It is in detail and well formatted that i enjoyed reading. which inturn helped me to get new information from your blog. After reading your article I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your articleJava training in Chennai
ReplyDeleteJava Online training in Chennai
Java Course in Chennai
Best JAVA Training Institutes in Chennai
Java training in Bangalore
Java training in Hyderabad
Java Training in Coimbatore
Java Training
Java Online Training
Quite Interesting post!!! Thanks for posting such a useful post. I wish to read your upcoming post to enhance my skill set, keep blogging.I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
ReplyDeleteselenium training in chennai
selenium training in chennai
selenium online training in chennai
software testing training in chennai
selenium training in bangalore
selenium training in hyderabad
selenium training in coimbatore
selenium online training
selenium training
Thanks for sharing an informative blog keep rocking bring more details.I like the helpful info you provide in your articles. I’ll bookmark your weblog and check again here regularly.
ReplyDeleteSoftware Testing Training in Chennai | Certification | Online
Courses
Software Testing Training in Chennai
Software Testing Online Training in Chennai
Software Testing Courses in Chennai
Software Testing Training in Bangalore
Software Testing Training in Hyderabad
Software Testing Training in Coimbatore
Software Testing Training
Software Testing Online Training
This blog is the general information for the feature. You got a good work for these blog.We have a developing our creative content of this mind.Thank you for this blog. This for very interesting and useful.
ReplyDeleteangular js training in chennai
angular training in chennai
angular js online training in chennai
angular js training in bangalore
angular js training in hyderabad
angular js training in coimbatore
angular js training
angular js online training
I found your blog while searching for the updates, I am happy to be here. Very useful content and also easily understandable providing.This is a wonderful article, Given so much info in it, Thanks for sharing.
ReplyDeleteDevOps Training in Chennai
DevOps Online Training in Chennai
DevOps Training in Bangalore
DevOps Training in Hyderabad
DevOps Training in Coimbatore
DevOps Training
DevOps Online Training
thanks for sharing such a nice info.I hope you will share more information like this. please keep on sharing!
ReplyDeletekeep up!!
Android Training in Chennai
Android Online Training in Chennai
Android Training in Bangalore
Android Training in Hyderabad
Android Training in Coimbatore
Android Training
Android Online Training
This blog is the general information for the feature. You got a good work for these blog.We have a developing our creative content .
ReplyDeleteacte reviews
acte velachery reviews
acte tambaram reviews
acte anna nagar reviews
acte porur reviews
acte omr reviews
acte chennai reviews
acte student reviews