Examples
Usage Examples¶
Corpus Creator with Sampling and Filtering¶
To showcase the corpus creator and its options, we generate a few different job postings corpora:
- a simple one from a single quarter's worth of data
- filtered on different fields like SOC code and base salary
- reservoir-sampled versions of each of the above
Extracting Skills using Noun Phrase Endings¶
To showcase the noun phrase skill extractor, we download open job postings from Virginia Tech's open data portal and run them through the skill extractor. In the end, we have the most commonly occurring noun phrases ending in 'skill' or 'skills'.
Skill Extraction and Evaluation Loop¶
To showcase how skill extraction algorithms can be tested, we run extraction several times with different parameters:
- Skill extraction algorithms (exact, fuzzy matching)
- Base ontologies, consisting of ONET subsetted to Abilities, Skills, Knowledge)
- Metrics (Total Vocabulary Size, Total Candidate Skills, Recall of Given Ontology)
Train an Word2Vec Embedding Model using Quarterly Jobposting Data¶
To showcase the interface of training a word2vec embedding model in an online batch learning fashion:
- A list of quarters for creating the corpus from job posting data
- A trainer object that specifies some parameters of source, s3 path, batch size, model type ...etc.
- The train method takes whatever arugments
gensim.models.word2vec.Word2Vec
orgensim.model.doc2vec.Doc2Vec
has
Compute and Aggregate Properties of Job Postings as a Tabular Dataset¶
To show job posting property computation and aggregation, we calculate job posting counts by cleaned title, and upload the resulting CSV to S3.
This is essentially a mini version of the Data@Work Research Hub.
To enable this example to be run with as few dependencies as possible, we use:
- a fake local s3 instance
- a sample of the Virginia Tech open job postings dataset
- only title cleaning and job counting.
Train an Occupation Classifier with Sample Jobposting Data¶
To showcase how occupation classifier can be trained using grid search and cross-validation:
- A sample of the Virginia Tech open job postings dataset
- An embedding model that is trained already
- The pipeline objects that takes in all steps including filters, transformation, tokenization and vectorization
- A config dictionary for grid search
- A matrix object that specifies the data source, target variable, pipelines
- An occupation classifier trainer object that specifies input matrix, number of folds, grid search config, storage and number of workers