[TOC]
skills_ml.algorithms.embedding
skills_ml.algorithms.embedding.base
skills_ml.algorithms.embedding.models
Embedding model class inherited the interface from gensim
Word2VecModel
Word2VecModel(self, model_name=None, storage=None, *args, **kwargs)
The Word2VecModel inherited from gensim's Word2Vec model ( https://radimrehurek.com/gensim/models/word2vec.html) for training, using and evaluating word embedding with extension methods.
Example
from skills_ml.algorithms.embedding.models import Word2VecModel
word2vec_model = Word2VecModel()
Doc2VecModel
Doc2VecModel(self, model_name=None, storage=None, *args, **kwargs)
The Doc2VecModel inherited from gensim's Doc2Vec model ( https://radimrehurek.com/gensim/models/doc2vec) for training, using and evaluating word embedding with extension methods.
Example
from skills_ml.algorithms.embedding.models import Doc2VecModel
doc2vec_model = Doc2VecModel()
FastTextModel
FastTextModel(self, model_name=None, storage=None, *args, **kwargs)
The FastTextModel inhereited from gensim's FastText model ( https://radimrehurek.com/gensim/models/fasttext.html) for training, using and evaluating word embedding with extension methods.
Example
```
from skills_ml.algorithms.embedding.models import import FastTextModel
fasttext = FastTextModel()
```
skills_ml.algorithms.embedding.train
EmbeddingTrainer
EmbeddingTrainer(self, *models, model_storage=None, batch_size=2000)
An embedding learning class. Example
from skills_ml.algorithms.occupation_classifiers.train import EmbeddingTrainer
from skills_ml.job_postings.common_schema import JobPostingCollectionSample
from skills_ml.job_postings.corpora.basic import Doc2VecGensimCorpusCreator, Word2VecGensimCorpusCreator
from skills_ml.storage import FSStore
model = Word2VecModel(size=size, min_count=min_count, iter=iter, window=window, workers=workers, **kwargs)
s3_conn = S3Hook().get_conn()
job_postings_generator = JobPostingGenerator(s3_conn, quarters, s3_path, source="all")
corpus_generator = Word2VecGensimCorpusCreator(job_postings_generator)
w2v = Word2VecModel(storage=FSStore(path='/tmp'), size=10, min_count=3, iter=4, window=6, workers=3)
trainer = EmbeddingTrainer(w2v)
trainer.train(corpus_generator)
trainer.save_model()
skills_ml.algorithms.geocoders
Geocoders, with caching and throttling
CachedGeocoder
CachedGeocoder(self, cache_storage, cache_fname, geocode_func=<function osm at 0x7f010819a510>, sleep_time=1, autosave=True)
Geocoder that uses specified storage as a cache.
Args
cache_storage (object) FSStore() or S3Store object to store the cache
cache_fname (string) cache file name
geocode_func (function) a function that geocodes a given search string
defaults to the OSM geocoder provided by the geocode library
sleep_time (int) The time, in seconds, between geocode calls
skills_ml.algorithms.geocoders.cbsa
Given geocode results, find matching Core-Based Statistical Areas.
Match
Match(self, /, *args, **kwargs)
Match(index, area)
CachedCBSAFinder
CachedCBSAFinder(self, cache_storage, cache_fname, shapefile_name=None, cache_dir=None)
Find CBSAs associated with geocode results and save them to the specified storage
Geocode results are expected in the json format provided by the python
geocoder
module, with a 'bbox'
The highest-level interface is the 'find_all_cbsas_and_save' method, which provides storage caching. A minimal call looks like
cache_storage = S3Store('some-bucket')
cache_fname = 'cbsas.json'
cbsa_finder = CachedCBSAFinder(cache_storage=cache_storage, cache_fname=cache_fname)
cbsa_finder.find_all_cbsas_and_save({
"Flushing, NY": { 'bbox': ['southwest': [..., ...], 'northeast': [...,...] }
"Houston, TX": { 'bbox': ['southwest': [..., ...], 'northeast': [...,...] }
})
# This usage of 'bbox' is what you can retrieve from a `geocoder` call, such as:
geocoder.osm('Flushing, NY').json()
The keys in the resulting cache will be the original search strings.
Warning: The caching is not parallel-safe! It is recommended you should run
only one copy of find_all_cbsas_and_save
at a time to avoid overwriting
the cache file.
Args
cache_storage (object) FSStore() or S3Store object to store the cache
cache_fname (string) cache file name
shapefile_name (string) local path to a CBSA shapefile to use
optional, will download TIGER 2015 shapefile if absent
cache_dir (string) local path to a cache directory to use if the
shapefile needs to be downloaded
optional, will use 'tmp' in working directory if absent
skills_ml.algorithms.job_normalizers
Algorithms to normalize a job title to a smaller space
skills_ml.algorithms.job_normalizers.elasticsearch
Indexes job postings for job title normalization
NormalizeTopNIndexer
NormalizeTopNIndexer(self, quarter, job_postings_generator, job_titles_index, alias_name, **kwargs)
Creates an index that stores data for job title normalization.
Depends on a previously created index with job titles and occupations.
Queries the job title/occupation index for 1. job titles or occupations that match the job description 2. Occupation matches
The top three results are indexed.
Args
quarter (string) the quarter from which to retrieve job postings
job_postings_generator (iterable) an iterable of job postings
job_title_index (string) The name of an already existing job title/occupation index
skills_ml.algorithms.job_normalizers.esa_jobtitle_normalizer
Normalize a job title through Explicit Semantic Analysis
Originally written by Kwame Porter Robinson
ESANormalizer
ESANormalizer(self, onet_source=<class 'skills_ml.datasets.onet_source.OnetToDiskDownloader'>)
Normalize a job title to ONET occupation titles using explicit semantic analysis.
Uses ONET occupation titles and descriptions.
skills_ml.algorithms.jobtitle_cleaner
Clean job titles
skills_ml.algorithms.jobtitle_cleaner.clean
Clean job titles by utilizing a list of stopwords
clean_by_rules
clean_by_rules(jobtitle)
Remove numbers and normalize spaces
Args
jobtitle (string) A string
- Returns: (string) the string with numbers removes and spaces normalized
clean_by_neg_dic
clean_by_neg_dic(jobtitle, negative_list, positive_list)
Remove words from the negative dictionary
Args
jobtitle (string) A job title string
negative_list (collection) A list of stop words
positive_list (collection) A list of positive words to override stop words
- Returns: (string) The cleaned job title
aggregate
aggregate(df_jobtitles, groupby_keys)
Args
- df_jobtitles: job titles in pandas DataFrame
- groupby_keys: a list of keys to be grouped by. should be something like ['title', 'geo'] Returns
agg_cleaned_jobtitles
: a aggregated verison of job title in pandas DataFrame
JobTitleStringClean
JobTitleStringClean(self)
Clean job titles by stripping numbers, and removing place/state names (unless they are also ONET jobs)
skills_ml.algorithms.nlp
String transformations for cleaning for unicodedata, see
http://www.unicode.org/reports/tr44/tr44-4.html#General_Category_Values
deep
deep(func)
A decorator that will apply a function to a nested list recursively
Args
- func (function): a function to be applied to a nested list Returns
function
: The wrapped function
normalize
normalize(text:str) -> str
Args
- text (str): A unicode string Returns
str
: The text, lowercased and in NFKD normal form
lowercase_strip_punc
lowercase_strip_punc(text:str, punct:Set[str]=None) -> str
Args
- text (str): A unicode string
- punct (:obj:
set
, optional) Returns
str
: The text, lowercased, sans punctuation and in NFKD normal form
title_phase_one
title_phase_one(text:str, punct:Set[str]=None) -> str
Args
- text (str): A unicode string
- punct (:obj:
set
, optional) Returns
str
: The text, lowercased, sans punctuation, whitespace normalized
clean_str
clean_str(text:str) -> str
Args
- text: A unicode string Returns
str
: lowercased, sans punctuation, non-English letters
sentence_tokenize
sentence_tokenize(text:str) -> List[str]
Args
- text (str): a unicode string Returns
list
: tokenized sentence
word_tokenize
word_tokenize(text:str, punctuation=True) -> List[str]
Args
- text (str): a unicode string Returns
list
: tokenized words
fields_join
fields_join(document:Dict, document_schema_fields:List[str]=None) -> str
Args
- document (dict): a document dictionary
- document_schema_fields (:obj:
list
, optional): a list of keys Returns
str
: a text joined with selected fields.
vectorize
vectorize(tokenized_text:List[str], embedding_model)
Args
- tokenized_text: a tokenized list of word tokens
- embedding_model: the embedding model implements
.infer_vector()
method Returns
np.ndarray
: a word embedding vector
section_extract
section_extract(section_regex:Pattern[~AnyStr], document:str) -> List
Only return the contents of the configured section heading
Defines a 'heading' as the text of a sentence that
- does not itself start with a bullet character
- either has between 1 and 3 words or ends in a colon
For a heading that matches the given pattern, returns each sentence between it and the next heading.
Heavily relies on the fact that sentence_tokenize does line splitting as well as standard sentence tokenization. In this way, it should work both for text strings that have newlines and for text strings that don't.
In addition, this function splits each sentence by bullet characters as often bullets denote what we want to call 'sentences', but authors often take advantage of the bullet characters to make the contents of each 'sentence' into small sentence fragments, which makes standard sentence tokenization insufficient if the newlines have been taken out.
split_by_bullets
split_by_bullets(sentence:str) -> List
Split sentence by bullet characters
strip_bullets_from_line
strip_bullets_from_line(line:str) -> str
Remove bullets from beginning of line
skills_ml.algorithms.occupation_classifiers
SOCMajorGroup
SOCMajorGroup(self, filters=None)
FullSOC
FullSOC(self, filters=None, onet_cache=None)
skills_ml.algorithms.occupation_classifiers.classifiers
SocClassifier
SocClassifier(self, classifier)
Interface of SOC Code Classifier for computer class to use.
KNNDoc2VecClassifier
KNNDoc2VecClassifier(self, embedding_model, k=1, indexer=None, model_name=None, model_storage=None, **kwargs)
Nearest neightbors model to classify the jobposting data into soc code. If the indexer is passed, then NearestNeighbors will use approximate nearest neighbor approach which is much faster than the built-in knn in gensim.
Attributes
embedding_model (
:job:skills_ml.algorithms.embedding.models.Doc2VecModel
): Doc2Vec embedding modelk (int)
: number of nearest neighbor. If k = 1, look for the soc code from single nearest neighbor. If k > 1, classify the soc code by the majority vote of nearest k neighbors.indexer (
:obj:gensim.similarities.index
): any kind of gensim compatible indexer
skills_ml.algorithms.occupation_classifiers.test
skills_ml.algorithms.occupation_classifiers.train
OccupationClassifierTrainer
OccupationClassifierTrainer(self, matrix, k_folds, grid_config=None, storage=None, random_state_for_split=None, scoring=['accuracy'], n_jobs=3)
Trains a series of classifiers using the same training set Args
- matrix (skills_ml.algorithms.train.matrix): a matrix object holds X, y and other training data information
- storage (skills_ml.storage): a skills_ml storage object specified the store method
- k_folds (int): number of folds for cross validation
- random_state_for_split(int): random state
- n_jobs (int): umber of jobs to run in parallel scores
skills_ml.algorithms.preprocessing
ProcessingPipeline
ProcessingPipeline(self, *functions:Callable)
A simple callable processing pipeline for imperative execution runtime.
This class will compose processing functions together to become a callable object that takes in the input from the very first processing function and returns the output of the last processing function.
Example
This class can be used to create a callable vectorization object which
will transform a string into a vector and also preserve the preprocessing
functions for being reused later.
```python
jp = JobPostingCollectionSample()
vectorization = ProcessingPipeline(
normalize,
clean_html,
clean_str,
word_tokenize,
partial(vectorize, embedding_model=w2v)
)
vector = vecotrization("Why so serious?")
```
Attributes
functions (generator)
: a series of functions
IterablePipeline
IterablePipeline(self, *functions:Callable)
A simple iterable processing pipeline.
This class will compose processing functions together to be passed to different stages(training/prediction) to assert the same processing procedrues.
Example
jp = JobPostingCollectionSample()
pipe = IterablePipeline(
partial(fields_join, document_schema_fields=['description']),
clean_html,
sentence_tokenize,
clean_str,
word_tokenize
)
preprocessed_generator = pipe(jp)
Attributes
functions (generator)
: a series of generator functions that takes another generator as input
func2gen
func2gen(func:Callable) -> Callable
A wrapper that change a document-transforming function that takes only one document the input into a function that takes a generator/iterator as the input. When it instantiates, it will become a generator.
Example
@func2gen
def do_something(doc)
return do_something_to_the_doc(doc)
Args
- func (function): a function only take one document as the first argument input.
Returns
func (function)
: a function that takes a generator as the first argument input.
skills_ml.algorithms.sampling
Generate and store samples of datasets
skills_ml.algorithms.sampling.methods
Generic sampling methods
reservoir
reservoir(it, k)
Reservoir sampling with Random Sort from a job posting iterator
Randomly choosing a sample of k items from a streaming iterator. Using random sort to implement the algorithm. Basically, it's assigning random number as keys to each item and maintain k items with minimum value for keys, which equals to assigning a random number to each item as key and sort items using these keys and take top k items.
Args
- it (iterator): Job posting iterator to sample from
- k (int): Sample size
Returns
generator
: The result sample of k items.
reservoir_weighted
reservoir_weighted(it, k, weights, key)
Weighted reservoir Sampling from job posting iterator
Randomly choosing a sample of k items from a streaming iterator based on the weights.
Args
- it (iterator): Job posting iterator to sample from. The format should be (job_posting, label)
- k (int): Sample size
- weights (dict): a dictionary that has key-value pairs as label-weighting pairs. It expects every label in the iterator to be present as a key in the weights dictionary For example,
- weights = {'11': 2, '13', 1}. In this case, the label/key is the occupation major group and the value is the weight you want to sample with.
Returns
generator
: The result sample of k items from weighted reservori sampling.
skills_ml.algorithms.skill_extractors
Extract skills from text corpora, such as job postings
skills_ml.algorithms.skill_extractors.base
Base classes for skill extraction
CandidateSkill
CandidateSkill(self, /, *args, **kwargs)
CandidateSkill(skill_name, matched_skill_identifier, context, start_index, confidence, document_id, document_type, source_object, skill_extractor_name)
Trie
Trie(self)
Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern. The corresponding Regex should match much faster than a simple Regex union.
SkillExtractor
SkillExtractor(self, transform_func:Callable=None)
Abstract class for all skill extractors.
All subclasses must implement candidate_skills.
All subclasses must define properties 'method' (a short machine readable property) 'description' (a text description of how the extractor does its work)
Args
transform_func (callable, optional) Function that transforms a structured object into text
Defaults to SimpleCorpusCreator's _join, which takes common text fields
in common schema job postings and concatenates them together.
For non-job postings another transform function may be needed.
ListBasedSkillExtractor
ListBasedSkillExtractor(self, competency_framework, *args, **kwargs)
Extract skills by comparing with a known lookup/list.
Subclasses must implement _skills_lookup and _document_skills_in_lookup
Args
skill_lookup_name (string, optional) An identifier for the skill lookup type.
Defaults to onet_ksat
skill_lookup_description (string, optional) A human-readable description of the skill lookup.
skills_ml.algorithms.skill_extractors.exact_match
Use exact matching with a source list to find skills
ExactMatchSkillExtractor
ExactMatchSkillExtractor(self, *args, **kwargs)
Extract skills from unstructured text
Builds a lookup based on the 'name' attribute of all competencies in the given framework
Originally written by Kwame Porter Robinson
skills_ml.algorithms.skill_extractors.fuzzy_match
Use fuzzy matching with a source list to extract skills from unstructured text
FuzzyMatchSkillExtractor
FuzzyMatchSkillExtractor(self, *args, **kwargs)
Extract skills from unstructured text using fuzzy matching
skills_ml.algorithms.skill_extractors.grammar
Use sentence grammar to extract phrases that may be skills
sentences_words_pos
sentences_words_pos(document)
Chops raw text into part-of-speech (POS)-tagged words in sentences
Args
document (string) A document in text format
- Returns: (list) of sentences, each being a list of word/POS pair
Example
sentences_words_pos(
'* Develop and maintain relationship with key members of ' +
'ESPN’s Spanish speaking editorial team'
)
[ # list of sentences
[ # list of word/POS pairs
('*', 'NN'),
('Develop', 'NNP'),
('and', 'CC'),
('maintain', 'VB'),
('relationship', 'NN'),
('with', 'IN'),
('key', 'JJ'),
('members', 'NNS'),
('of', 'IN'),
('ESPN', 'NNP'),
('’', 'NNP'),
('s', 'VBD'),
('Spanish', 'JJ'),
('speaking', 'NN'),
('editorial', 'NN'),
('team', 'NN')
]
]
phrases_in_line_with_context
phrases_in_line_with_context(line, parser, target_labels)
Generate phrases in the given line of text
Args
- text (string): A line of raw text
Yields
tuples, each with two strings
- a noun phrase
- the context of the noun phrase (currently defined as the surrounding sentence)
is_bulleted
is_bulleted(string)
Whether or not a given string begins a 'bullet' character
A bullet character is understood to indicate list membership. Differeing common bullet characters are checked.
Args
-
string (string): Any string
-
Returns: (bool) whether or not the string begins with one of the characters in a predefined list of common bullets
clean_beginning
clean_beginning(string)
Clean the beginning of a string of common undesired formatting substrings
Args
-
string (string): Any string
-
Returns: The string with beginning formatting substrings removed
NPEndPatternExtractor
NPEndPatternExtractor(self, endings, stop_phrases, only_bulleted_lines=True, confidence=95, *args, **kwargs)
Identify noun phrases with certain ending words (e.g 'skills', 'abilities') as skills
Args
- endings (list): Single words that should identify the ending of a noun phrase as being a skill
- stop_phrases (list): Noun phrases that should not be considered skills
- only_bulleted_lines (bool, default True): Whether or not to only consider lines that look like they are items in a list
SkillEndingPatternExtractor
SkillEndingPatternExtractor(self, *args, **kwargs)
Identify noun phrases ending with 'skill' or 'skills' as skills
AbilityEndingPatternExtractor
AbilityEndingPatternExtractor(self, *args, **kwargs)
Identify noun phrases ending in 'ability' or 'abilities' as skills
skills_ml.algorithms.skill_extractors.noun_phrase_ending
Use noun phrases with specific endings to extract skills from job postings
sentences_words_pos
sentences_words_pos(document)
Chops raw text into part-of-speech (POS)-tagged words in sentences
Args
document (string) A document in text format
- Returns: (list) of sentences, each being a list of word/POS pair
Example
sentences_words_pos(
'* Develop and maintain relationship with key members of ' +
'ESPN’s Spanish speaking editorial team'
)
[ # list of sentences
[ # list of word/POS pairs
('*', 'NN'),
('Develop', 'NNP'),
('and', 'CC'),
('maintain', 'VB'),
('relationship', 'NN'),
('with', 'IN'),
('key', 'JJ'),
('members', 'NNS'),
('of', 'IN'),
('ESPN', 'NNP'),
('’', 'NNP'),
('s', 'VBD'),
('Spanish', 'JJ'),
('speaking', 'NN'),
('editorial', 'NN'),
('team', 'NN')
]
]
noun_phrases_in_line_with_context
noun_phrases_in_line_with_context(line)
Generate noun phrases in the given line of text
Args
- text (string): A line of raw text
Yields
tuples, each with two strings
- a noun phrase
- the context of the noun phrase (currently defined as the surrounding sentence)
is_bulleted
is_bulleted(string)
Whether or not a given string begins a 'bullet' character
A bullet character is understood to indicate list membership. Differeing common bullet characters are checked.
Args
-
string (string): Any string
-
Returns: (bool) whether or not the string begins with one of the characters in a predefined list of common bullets
clean_beginning
clean_beginning(string)
Clean the beginning of a string of common undesired formatting substrings
Args
-
string (string): Any string
-
Returns: The string with beginning formatting substrings removed
NPEndPatternExtractor
NPEndPatternExtractor(self, endings, stop_phrases, only_bulleted_lines=True, confidence=95, *args, **kwargs)
Identify noun phrases with certain ending words (e.g 'skills', 'abilities') as skills
Args
- endings (list): Single words that should identify the ending of a noun phrase as being a skill
- stop_phrases (list): Noun phrases that should not be considered skills
- only_bulleted_lines (bool, default True): Whether or not to only consider lines that look like they are items in a list
SkillEndingPatternExtractor
SkillEndingPatternExtractor(self, *args, **kwargs)
Identify noun phrases ending with 'skill' or 'skills' as skills
AbilityEndingPatternExtractor
AbilityEndingPatternExtractor(self, *args, **kwargs)
Identify noun phrases ending in 'ability' or 'abilities' as skills
skills_ml.algorithms.skill_extractors.section_extract
SectionExtractSkillExtractor
SectionExtractSkillExtractor(self, section_regex=None, *args, **kwargs)
Extract skills from text by extracting sentences from matching 'sections'.
Heavily utilizes skills_ml.algorithms.nlp.section_extract. For more detail on how to define 'sections', refer to its docstring.
skills_ml.algorithms.skill_extractors.soc_exact
SocScopedExactMatchSkillExtractor
SocScopedExactMatchSkillExtractor(self, competency_ontology, *args, **kwargs)
Extract skills from unstructured text, but only return matches that agree with a known taxonomy
skills_ml.algorithms.skill_extractors.symspell
SymSpell
SymSpell(self, max_dictionary_edit_distance=2, prefix_length=7, count_threshold=1)
SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm. The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster and language independent. Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters! SymSpell supports compound splitting / decompounding of multi-word input strings with three cases
- mistakenly inserted space into a correct word led to two incorrect terms
- mistakenly omitted space between two correct words led to one incorrect combined term
-
multiple independent input terms with/without spelling errors See https://github.com/wolfgarbe/SymSpell for details. Args
-
max_dictionary_edit_distance (int, optional): Maximum distance used to generate index. Also acts as an upper bound for
max_edit_distance
parameter inlookup()
method. Defaults to 2. - prefix_length (int, optional): Prefix length. Should not be changed normally. Defaults to 7.
- count_threshold (int, optional): Threshold corpus-count value for words to be considered correct. Defaults to 1, values below zero are also mapped to 1. Consider setting a higher value if your corpus contains mistakes.
skills_ml.algorithms.skill_feature_creator
SequenceFeatureCreator
SequenceFeatureCreator(self, job_posting_generator, sentence_tokenizer=None, word_tokenizer=None, features=None, embedding_model=None)
Sequence Feature Creator helps users to instantiate different types of feature at once and combine them together into a sentence(sequence) feature array for sequence modeling. It's a generator that outputs a sentence array at a time. A sentence array is composed of word vectors.
Example
from skills_ml.algorithms.skill_feature_creator import FeatureCreator
feature_vector_generator = FeatureCreator(job_posting_generator)
feature_vector_generator = FeatureCreator(job_posting_generator, features=["StructuralFeature", "EmbeddingFeature"])
Args
- job_posting_generator (generator): job posting generator.
- sentence_tokenizer (func): sentence tokenization function
- word_tokenizer (func): word tokenization function
- features (list): list of feature types ones want to include. If it's None or by default, it includes all the feature types.
Yield
sentence_array (numpy.array): an array of word vectors represents the words and punctuations in the sentence. The dimension
is (# of words)*(dimension of the concat word vector)
StructuralFeature
StructuralFeature(self, sentence_tokenizer=None, word_tokenizer=None, **kwargs)
Sturctural features
ContextualFeature
ContextualFeature(self, sentence_tokenizer=None, word_tokenizer=None, **kwargs)
Contextual features
EmbeddingFeature
EmbeddingFeature(self, sentence_tokenizer=None, word_tokenizer=None, **kwargs)
Embedding Feature