HR professionals often go through the process of matching jobs from various compensation surveys to determine the pay ranges for related job codes. However, this is often a manual and lengthy process involving a lot of human judgement. One way to that HR can use Python to save time in this process is by building a kth-nearest neighbor model with sci-kit learn to “fuzzy-match” jobs.
So let’s get started by importing all the relevant modules.
Code
import osimport sysimport pandas as pdimport difflibimport regex as refrom ftfy import fix_textfrom sklearn.neighbors import NearestNeighborsfrom sklearn.feature_extraction.text import TfidfVectorizerimport numpy as npfrom scipy.sparse import csr_matriximport sparse_dot_topn.sparse_dot_topn as ctimport timefrom great_tables import GT, exibblefrom sklearn.utils import shuffleimport seaborn as snsimport warningspd.set_option('display.max_columns', None) #show all columns when using head()pd.options.display.float_format ='{:.2f}'.formatnp.set_printoptions(suppress=True) # suppres scientific notationwarnings.filterwarnings("ignore")
Next, we will load the CSV files from each survey. We need to encode the files in the default Microsoft format (code page 1252). Let’s take a peek at the first few rows of those files too.
Job Summary: Designs, implements, and maintains wa
Research
2112_2
Physicist IV
Job Summary: Researches, studies, and collects dat
Finance
13171
Sales Analysis Director Senior
Job Summary: Leads and directs the sales analysis
Research
Now let’s take another look at the number of jobs per job family for each survey.
Code
g = sns.catplot(x="Family", data=survey_a, kind="count", order=survey_a.Family.value_counts().index, height=3, aspect=2)g.set_axis_labels("", "Count of Jobs - Survey A")survey_a.describe()
Job Code
Job Title
Description
Family
count
358
358
358
358
unique
345
355
358
7
top
11175
Mechanical Engineer II
Assists in maintaining financial records and e...
IT
freq
3
2
1
147
Code
g2 = sns.catplot(x="Family", data=survey_b, kind="count", order=survey_b.Family.value_counts().index, height=3, aspect=2)g2.set_axis_labels("", "Count of Jobs - Survey B")survey_b.describe()
Job Code
Job Title
Description
Family
count
367
367
367
367
unique
352
364
367
7
top
2111
Cost Accountant IV
Provides administrative support to for one or ...
IT
freq
3
2
1
150
Ok they both look very similar, so I see no reason why not to proceed, first by creating a column to join the data on. Then comes the good part. Here we set the model parameters, removing all special characters and setting the whole string to the same case. We’ll try to match every third character (“n=3”) so as to maximize the potential matches.
Code
survey_a['match_column'] = survey_a['Job Title'] +'_'+ survey_a['Description'] #columns to join onsurvey_b['match_column'] = survey_b['Job Title'] +'_'+ survey_b['Description']#model def ngrams(string, n=3): string =str(string) string = fix_text(string) # fix text string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars string = string.lower() chars_to_remove = [")","(",".","|","[","]","{","}","'",":","-"] rx ='['+ re.escape(''.join(chars_to_remove)) +']' string = re.sub(rx, '', string) string = string.replace('&', 'and') string = string.replace(',', ' ') string = string.replace('-', ' ') string = string.replace('_', ' ') string = string.title() # normalise case - capital at start of each word string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single string =' '+ string +' '# pad names for ngrams... string = re.sub(r'[,-./]|\sBD',r'', string) ngrams =zip(*[string[i:] for i inrange(n)])return [''.join(ngram) for ngram in ngrams]#convert to unicodeunique_a = survey_a['match_column'].unique().astype('U')unique_b = survey_b['match_column'].unique().astype('U')
This next step vectorizes the data using a term-frequency / inverse document frequency method common to vector databases. Yes, this is effectively the same method that you would use to build a vector database common to generative AI applications.
Code
### Build vectorizervectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)### Nearest neighbors on matching columntfidf = vectorizer.fit_transform(unique_a.astype('U'))nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)### Matching query, assigning numerical values to the distances between each row in the match columnsdef getNearestN(query): queryTFIDF_ = vectorizer.transform(query) distances, indices = nbrs.kneighbors(queryTFIDF_)return distances, indices### Getting nearest Ndistances, indices = getNearestN(unique_b)### convert back to a listunique_b =list(unique_b)### find matchesmatches = []for i,j inenumerate(indices): temp = [round(distances[i][0],2), unique_a[j][0],unique_b[i]] matches.append(temp)### building data framematches = pd.DataFrame(matches) matches = matches.rename({0: "match_confidence",1: "match_column_a",2:"match_column_b"}, axis=1)### bring survey a data back inmatches_a = survey_a.merge(matches, left_on='match_column', right_on='match_column_a', how='left', suffixes=('','_a'))### merge back with survey b datamatches_all = pd.merge(matches_a, survey_b, left_on='match_column_b', right_on='match_column', how='left', suffixes=('','_b'))### clean upmatches_all.columns = matches_all.columns.str.replace(' ', '_') #replaces spaces with underscoresmatches_all.columns =map(str.lower, matches_all.columns) #make all column headers lowercasematches_all = matches_all.dropna(subset=['match_confidence']) #drop NAsbest_match_score = matches_all### flip signs so that higher confidence is better and roundbest_match_score.loc[:, 'match_confidence'] =round( (best_match_score.loc[:, 'match_confidence']-min(best_match_score.loc[:, 'match_confidence']))/ (max(best_match_score.loc[:, 'match_confidence'])-min(best_match_score.loc[:, 'match_confidence'])) ,3)#(x-min(x))/(max(x)-min(x))### save as csv and read back inbest_match_score.to_csv(obj_path +'best_match_score.csv', index=False)preview_c = pd.read_csv(obj_path +'best_match_score.csv', encoding='cp1252')
Hooray. Now we’ve built the model. Let’s sort and pick out the final columns we need before we preview a random sample of the final data.
Job Summary: Researches, studies, and collects data on matter, energy, and forces. Job Duties: Designs and implements experiments using scientific met . . .
2118_2
Physicist II
Job Summary: Researches, studies, and collects data on matter, energy, and forces. Job Duties: Designs and implements experiments using scientific met . . .
0.224
3817
Geographic Information Systems (GIS) Analyst III
Job Summary: Applies knowledge of Geographic Information Systems (GIS) to a variety of projects for analytical, land use, and environmental purposes; . . .
3810
Geographic Information Systems (GIS) Analyst II
Job Summary: Applies knowledge of Geographic Information Systems (GIS) to a variety of projects for analytical, land use, and environmental purposes; . . .
0.071
11853
Reliability Analyst (Aircraft) II
Job Summary: Collects, aggregates, and analyzes data measuring the performance and reliability of aircraft systems and components. Job Duties: Evaluat . . .
11854
Reliability Analyst (Aircraft) III
Job Summary: Collects, aggregates, and analyzes data measuring the performance and reliability of aircraft systems and components. Job Duties: Evaluat . . .
0.092
3704
Value Analyst III
Job Summary: Conducts financial analysis of purchasing or supply chain activities with the goal of reducing costs while preserving organizational perf . . .
6580_2
Value Analyst I
Job Summary: Conducts financial analysis of purchasing or supply chain activities with the goal of reducing costs while preserving organizational perf . . .
0.204
202
Mortgage Loan Officer I
Job Summary: Markets and FHA/VA or conventional residential mortgage loans. Job Duties: Meets with realtors, developers, builders, or potential custom . . .
203_2
Mortgage Loan Officer II
Job Summary: Markets and FHA/VA or conventional residential mortgage loans. Job Duties: Meets with realtors, developers, builders, or potential custom . . .
0.082
3375
IT Systems Application Analyst I
Job Summary: Analyzes and evaluates existing or proposed systems and devices to process data. Job Duties: Prepares charts and diagrams to assist in pr . . .
3370
IT Systems Application Analyst II
Job Summary: Analyzes and evaluates existing or proposed systems and devices to process data. Job Duties: Prepares charts and diagrams to assist in pr . . .
0.184
3053
Research and Development Executive
Job Summary: Responsible for all operations associated with research projects meant to develop new or improved products, processes, or services for an . . .
11731
Research and Development Director
Job Summary: Directs and oversees a portfolio of research projects that are meant to develop new or improved products, processes, or services. Job Dut . . .
0.224
11101
Sales Analysis Director
Job Summary: Directs the sales analysis function of an organization. Job Duties: Oversees the collection and evaluation of sales data as well as the a . . .
73_4
Sales Analyst
Manages and administers sales related data processes for the organization’s product/services and competitive products/services including reporting a . . .
0.786
1410
IT Client/Server Database Analyst V
Job Summary: Evaluates and designs existing or proposed systems to access large-scale client/server databases. Job Duties: Analyzes, evaluates, and pr . . .
1418_2
IT Client/Server Database Analyst IV
Job Summary: Evaluates and designs existing or proposed systems to access large-scale client/server databases. Job Duties: Analyzes, evaluates, and pr . . .
0.194
3817
Geographic Information Systems (GIS) Analyst III
Job Summary: Applies knowledge of Geographic Information Systems (GIS) to a variety of projects for analytical, land use, and environmental purposes; . . .
3818
Geographic Information Systems (GIS) Analyst IV
Job Summary: Applies knowledge of Geographic Information Systems (GIS) to a variety of projects for analytical, land use, and environmental purposes; . . .
0.082
Looks like several good matches. One more preview of the descriptive statistics, and then I think we’re done.
Code
preview.describe()
match_confidence
count
367.00
mean
0.43
std
0.28
min
0.00
25%
0.17
50%
0.40
75%
0.66
max
1.00
Now with this file, we can pull in pay ranges for similar jobs with a quantified measure of the degree to which we have a match between surveys. Success.
Thanks for taking a look at this sample of how to perform job-matching in Python. Hope you find it useful.
Thanks to Josh Taylor for providing the basics of this method.