[TOC]

skills_ml.algorithms.embedding

skills_ml.algorithms.embedding.base

skills_ml.algorithms.embedding.models

Embedding model class inherited the interface from gensim

Word2VecModel

Word2VecModel(self, model_name=None, storage=None, *args, **kwargs)

The Word2VecModel inherited from gensim's Word2Vec model ( https://radimrehurek.com/gensim/models/word2vec.html) for training, using and evaluating word embedding with extension methods.

Example

from skills_ml.algorithms.embedding.models import Word2VecModel

word2vec_model = Word2VecModel()

Doc2VecModel

Doc2VecModel(self, model_name=None, storage=None, *args, **kwargs)

The Doc2VecModel inherited from gensim's Doc2Vec model ( https://radimrehurek.com/gensim/models/doc2vec) for training, using and evaluating word embedding with extension methods.

Example

from skills_ml.algorithms.embedding.models import Doc2VecModel

doc2vec_model = Doc2VecModel()

FastTextModel

FastTextModel(self, model_name=None, storage=None, *args, **kwargs)

The FastTextModel inhereited from gensim's FastText model ( https://radimrehurek.com/gensim/models/fasttext.html) for training, using and evaluating word embedding with extension methods.

Example

```
from skills_ml.algorithms.embedding.models import import FastTextModel

fasttext = FastTextModel()
```

skills_ml.algorithms.embedding.train

EmbeddingTrainer

EmbeddingTrainer(self, *models, model_storage=None, batch_size=2000)

An embedding learning class. Example

from skills_ml.algorithms.occupation_classifiers.train import EmbeddingTrainer
from skills_ml.job_postings.common_schema import JobPostingCollectionSample
from skills_ml.job_postings.corpora.basic import Doc2VecGensimCorpusCreator, Word2VecGensimCorpusCreator
from skills_ml.storage import FSStore

model = Word2VecModel(size=size, min_count=min_count, iter=iter, window=window, workers=workers, **kwargs)

s3_conn = S3Hook().get_conn()
job_postings_generator = JobPostingGenerator(s3_conn, quarters, s3_path, source="all")
corpus_generator = Word2VecGensimCorpusCreator(job_postings_generator)
w2v = Word2VecModel(storage=FSStore(path='/tmp'), size=10, min_count=3, iter=4, window=6, workers=3)
trainer = EmbeddingTrainer(w2v)
trainer.train(corpus_generator)
trainer.save_model()

skills_ml.algorithms.geocoders

Geocoders, with caching and throttling

CachedGeocoder

CachedGeocoder(self, cache_storage, cache_fname, geocode_func=<function osm at 0x7f010819a510>, sleep_time=1, autosave=True)

Geocoder that uses specified storage as a cache.

Args

cache_storage (object) FSStore() or S3Store object to store the cache
cache_fname (string) cache file name
geocode_func (function) a function that geocodes a given search string
    defaults to the OSM geocoder provided by the geocode library
sleep_time (int) The time, in seconds, between geocode calls

skills_ml.algorithms.geocoders.cbsa

Given geocode results, find matching Core-Based Statistical Areas.

Match

Match(self, /, *args, **kwargs)

Match(index, area)

CachedCBSAFinder

CachedCBSAFinder(self, cache_storage, cache_fname, shapefile_name=None, cache_dir=None)

Find CBSAs associated with geocode results and save them to the specified storage

Geocode results are expected in the json format provided by the python geocoder module, with a 'bbox'

The highest-level interface is the 'find_all_cbsas_and_save' method, which provides storage caching. A minimal call looks like

cache_storage = S3Store('some-bucket')
cache_fname = 'cbsas.json'
cbsa_finder = CachedCBSAFinder(cache_storage=cache_storage, cache_fname=cache_fname)
cbsa_finder.find_all_cbsas_and_save({
    "Flushing, NY": { 'bbox': ['southwest': [..., ...], 'northeast': [...,...] }
    "Houston, TX": { 'bbox': ['southwest': [..., ...], 'northeast': [...,...] }
})

# This usage of 'bbox' is what you can retrieve from a `geocoder` call, such as:
geocoder.osm('Flushing, NY').json()

The keys in the resulting cache will be the original search strings.

Warning: The caching is not parallel-safe! It is recommended you should run only one copy of find_all_cbsas_and_save at a time to avoid overwriting the cache file.

Args

cache_storage (object) FSStore() or S3Store object to store the cache
cache_fname (string) cache file name
shapefile_name (string) local path to a CBSA shapefile to use
    optional, will download TIGER 2015 shapefile if absent
cache_dir (string) local path to a cache directory to use if the
    shapefile needs to be downloaded
    optional, will use 'tmp' in working directory if absent

skills_ml.algorithms.job_normalizers

Algorithms to normalize a job title to a smaller space

skills_ml.algorithms.job_normalizers.elasticsearch

Indexes job postings for job title normalization

NormalizeTopNIndexer

NormalizeTopNIndexer(self, quarter, job_postings_generator, job_titles_index, alias_name, **kwargs)

Creates an index that stores data for job title normalization.

Depends on a previously created index with job titles and occupations.

Queries the job title/occupation index for 1. job titles or occupations that match the job description 2. Occupation matches

The top three results are indexed.

Args

quarter (string) the quarter from which to retrieve job postings
job_postings_generator (iterable) an iterable of job postings
job_title_index (string) The name of an already existing job title/occupation index

skills_ml.algorithms.job_normalizers.esa_jobtitle_normalizer

Normalize a job title through Explicit Semantic Analysis

Originally written by Kwame Porter Robinson

ESANormalizer

ESANormalizer(self, onet_source=<class 'skills_ml.datasets.onet_source.OnetToDiskDownloader'>)

Normalize a job title to ONET occupation titles using explicit semantic analysis.

Uses ONET occupation titles and descriptions.

skills_ml.algorithms.jobtitle_cleaner

Clean job titles

skills_ml.algorithms.jobtitle_cleaner.clean

Clean job titles by utilizing a list of stopwords

clean_by_rules

clean_by_rules(jobtitle)

Remove numbers and normalize spaces

Args

jobtitle (string) A string
  • Returns: (string) the string with numbers removes and spaces normalized

clean_by_neg_dic

clean_by_neg_dic(jobtitle, negative_list, positive_list)

Remove words from the negative dictionary

Args

jobtitle (string) A job title string
negative_list (collection) A list of stop words
positive_list (collection) A list of positive words to override stop words
  • Returns: (string) The cleaned job title

aggregate

aggregate(df_jobtitles, groupby_keys)

Args

  • df_jobtitles: job titles in pandas DataFrame
  • groupby_keys: a list of keys to be grouped by. should be something like ['title', 'geo'] Returns

agg_cleaned_jobtitles: a aggregated verison of job title in pandas DataFrame

JobTitleStringClean

JobTitleStringClean(self)

Clean job titles by stripping numbers, and removing place/state names (unless they are also ONET jobs)

skills_ml.algorithms.nlp

String transformations for cleaning for unicodedata, see

http://www.unicode.org/reports/tr44/tr44-4.html#General_Category_Values

deep

deep(func)

A decorator that will apply a function to a nested list recursively

Args

  • func (function): a function to be applied to a nested list Returns

function: The wrapped function

normalize

normalize(text:str) -> str

Args

  • text (str): A unicode string Returns

str: The text, lowercased and in NFKD normal form

lowercase_strip_punc

lowercase_strip_punc(text:str, punct:Set[str]=None) -> str

Args

  • text (str): A unicode string
  • punct (:obj: set, optional) Returns

str: The text, lowercased, sans punctuation and in NFKD normal form

title_phase_one

title_phase_one(text:str, punct:Set[str]=None) -> str

Args

  • text (str): A unicode string
  • punct (:obj: set, optional) Returns

str: The text, lowercased, sans punctuation, whitespace normalized

clean_str

clean_str(text:str) -> str

Args

  • text: A unicode string Returns

str: lowercased, sans punctuation, non-English letters

sentence_tokenize

sentence_tokenize(text:str) -> List[str]

Args

  • text (str): a unicode string Returns

list: tokenized sentence

word_tokenize

word_tokenize(text:str, punctuation=True) -> List[str]

Args

  • text (str): a unicode string Returns

list: tokenized words

fields_join

fields_join(document:Dict, document_schema_fields:List[str]=None) -> str

Args

  • document (dict): a document dictionary
  • document_schema_fields (:obj: list, optional): a list of keys Returns

str: a text joined with selected fields.

vectorize

vectorize(tokenized_text:List[str], embedding_model)

Args

  • tokenized_text: a tokenized list of word tokens
  • embedding_model: the embedding model implements .infer_vector() method Returns

np.ndarray: a word embedding vector

section_extract

section_extract(section_regex:Pattern[~AnyStr], document:str) -> List

Only return the contents of the configured section heading

Defines a 'heading' as the text of a sentence that

- does not itself start with a bullet character
- either has between 1 and 3 words or ends in a colon

For a heading that matches the given pattern, returns each sentence between it and the next heading.

Heavily relies on the fact that sentence_tokenize does line splitting as well as standard sentence tokenization. In this way, it should work both for text strings that have newlines and for text strings that don't.

In addition, this function splits each sentence by bullet characters as often bullets denote what we want to call 'sentences', but authors often take advantage of the bullet characters to make the contents of each 'sentence' into small sentence fragments, which makes standard sentence tokenization insufficient if the newlines have been taken out.

split_by_bullets

split_by_bullets(sentence:str) -> List

Split sentence by bullet characters

strip_bullets_from_line

strip_bullets_from_line(line:str) -> str

Remove bullets from beginning of line

skills_ml.algorithms.occupation_classifiers

SOCMajorGroup

SOCMajorGroup(self, filters=None)

FullSOC

FullSOC(self, filters=None, onet_cache=None)

skills_ml.algorithms.occupation_classifiers.classifiers

SocClassifier

SocClassifier(self, classifier)

Interface of SOC Code Classifier for computer class to use.

KNNDoc2VecClassifier

KNNDoc2VecClassifier(self, embedding_model, k=1, indexer=None, model_name=None, model_storage=None, **kwargs)

Nearest neightbors model to classify the jobposting data into soc code. If the indexer is passed, then NearestNeighbors will use approximate nearest neighbor approach which is much faster than the built-in knn in gensim.

Attributes

  • embedding_model (:job: skills_ml.algorithms.embedding.models.Doc2VecModel): Doc2Vec embedding model
  • k (int): number of nearest neighbor. If k = 1, look for the soc code from single nearest neighbor. If k > 1, classify the soc code by the majority vote of nearest k neighbors.
  • indexer (:obj: gensim.similarities.index): any kind of gensim compatible indexer

skills_ml.algorithms.occupation_classifiers.test

skills_ml.algorithms.occupation_classifiers.train

OccupationClassifierTrainer

OccupationClassifierTrainer(self, matrix, k_folds, grid_config=None, storage=None, random_state_for_split=None, scoring=['accuracy'], n_jobs=3)

Trains a series of classifiers using the same training set Args

  • matrix (skills_ml.algorithms.train.matrix): a matrix object holds X, y and other training data information
  • storage (skills_ml.storage): a skills_ml storage object specified the store method
  • k_folds (int): number of folds for cross validation
  • random_state_for_split(int): random state
  • n_jobs (int): umber of jobs to run in parallel scores

skills_ml.algorithms.preprocessing

ProcessingPipeline

ProcessingPipeline(self, *functions:Callable)

A simple callable processing pipeline for imperative execution runtime.

This class will compose processing functions together to become a callable object that takes in the input from the very first processing function and returns the output of the last processing function.

Example

This class can be used to create a callable vectorization object which
will transform a string into a vector and also preserve the preprocessing
functions for being reused later.
```python
jp = JobPostingCollectionSample()
vectorization = ProcessingPipeline(
    normalize,
    clean_html,
    clean_str,
    word_tokenize,
    partial(vectorize, embedding_model=w2v)
)

vector = vecotrization("Why so serious?")
```

Attributes

  • functions (generator): a series of functions

IterablePipeline

IterablePipeline(self, *functions:Callable)

A simple iterable processing pipeline.

This class will compose processing functions together to be passed to different stages(training/prediction) to assert the same processing procedrues.

Example

jp = JobPostingCollectionSample()
pipe = IterablePipeline(
    partial(fields_join, document_schema_fields=['description']),
    clean_html,
    sentence_tokenize,
    clean_str,
    word_tokenize
)
preprocessed_generator = pipe(jp)

Attributes

  • functions (generator): a series of generator functions that takes another generator as input

func2gen

func2gen(func:Callable) -> Callable

A wrapper that change a document-transforming function that takes only one document the input into a function that takes a generator/iterator as the input. When it instantiates, it will become a generator.

Example

@func2gen

def do_something(doc)

    return do_something_to_the_doc(doc)

Args

  • func (function): a function only take one document as the first argument input.

Returns

func (function): a function that takes a generator as the first argument input.

skills_ml.algorithms.sampling

Generate and store samples of datasets

skills_ml.algorithms.sampling.methods

Generic sampling methods

reservoir

reservoir(it, k)

Reservoir sampling with Random Sort from a job posting iterator

Randomly choosing a sample of k items from a streaming iterator. Using random sort to implement the algorithm. Basically, it's assigning random number as keys to each item and maintain k items with minimum value for keys, which equals to assigning a random number to each item as key and sort items using these keys and take top k items.

Args

  • it (iterator): Job posting iterator to sample from
  • k (int): Sample size

Returns

generator: The result sample of k items.

reservoir_weighted

reservoir_weighted(it, k, weights, key)

Weighted reservoir Sampling from job posting iterator

Randomly choosing a sample of k items from a streaming iterator based on the weights.

Args

  • it (iterator): Job posting iterator to sample from. The format should be (job_posting, label)
  • k (int): Sample size
  • weights (dict): a dictionary that has key-value pairs as label-weighting pairs. It expects every label in the iterator to be present as a key in the weights dictionary For example,
  • weights = {'11': 2, '13', 1}. In this case, the label/key is the occupation major group and the value is the weight you want to sample with.

Returns

generator: The result sample of k items from weighted reservori sampling.

skills_ml.algorithms.skill_extractors

Extract skills from text corpora, such as job postings

skills_ml.algorithms.skill_extractors.base

Base classes for skill extraction

CandidateSkill

CandidateSkill(self, /, *args, **kwargs)

CandidateSkill(skill_name, matched_skill_identifier, context, start_index, confidence, document_id, document_type, source_object, skill_extractor_name)

Trie

Trie(self)

Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern. The corresponding Regex should match much faster than a simple Regex union.

SkillExtractor

SkillExtractor(self, transform_func:Callable=None)

Abstract class for all skill extractors.

All subclasses must implement candidate_skills.

All subclasses must define properties 'method' (a short machine readable property) 'description' (a text description of how the extractor does its work)

Args

transform_func (callable, optional) Function that transforms a structured object into text
    Defaults to SimpleCorpusCreator's _join, which takes common text fields
    in common schema job postings and concatenates them together.
    For non-job postings another transform function may be needed.

ListBasedSkillExtractor

ListBasedSkillExtractor(self, competency_framework, *args, **kwargs)

Extract skills by comparing with a known lookup/list.

Subclasses must implement _skills_lookup and _document_skills_in_lookup

Args

skill_lookup_name (string, optional) An identifier for the skill lookup type.
    Defaults to onet_ksat
skill_lookup_description (string, optional) A human-readable description of the skill lookup.

skills_ml.algorithms.skill_extractors.exact_match

Use exact matching with a source list to find skills

ExactMatchSkillExtractor

ExactMatchSkillExtractor(self, *args, **kwargs)

Extract skills from unstructured text

Builds a lookup based on the 'name' attribute of all competencies in the given framework

Originally written by Kwame Porter Robinson

skills_ml.algorithms.skill_extractors.fuzzy_match

Use fuzzy matching with a source list to extract skills from unstructured text

FuzzyMatchSkillExtractor

FuzzyMatchSkillExtractor(self, *args, **kwargs)

Extract skills from unstructured text using fuzzy matching

skills_ml.algorithms.skill_extractors.grammar

Use sentence grammar to extract phrases that may be skills

sentences_words_pos

sentences_words_pos(document)

Chops raw text into part-of-speech (POS)-tagged words in sentences

Args

document (string) A document in text format
  • Returns: (list) of sentences, each being a list of word/POS pair

Example

sentences_words_pos(
    '* Develop and maintain relationship with key members of ' +
    'ESPN’s Spanish speaking editorial team'
)
[ # list of sentences
    [ # list of word/POS pairs
        ('*', 'NN'),
        ('Develop', 'NNP'),
        ('and', 'CC'),
        ('maintain', 'VB'),
        ('relationship', 'NN'),
        ('with', 'IN'),
        ('key', 'JJ'),
        ('members', 'NNS'),
        ('of', 'IN'),
        ('ESPN', 'NNP'),
        ('’', 'NNP'),
        ('s', 'VBD'),
        ('Spanish', 'JJ'),
        ('speaking', 'NN'),
        ('editorial', 'NN'),
        ('team', 'NN')
    ]
]

phrases_in_line_with_context

phrases_in_line_with_context(line, parser, target_labels)

Generate phrases in the given line of text

Args

  • text (string): A line of raw text

Yields

tuples, each with two strings

    - a noun phrase
    - the context of the noun phrase (currently defined as the surrounding sentence)

is_bulleted

is_bulleted(string)

Whether or not a given string begins a 'bullet' character

A bullet character is understood to indicate list membership. Differeing common bullet characters are checked.

Args

  • string (string): Any string

  • Returns: (bool) whether or not the string begins with one of the characters in a predefined list of common bullets

clean_beginning

clean_beginning(string)

Clean the beginning of a string of common undesired formatting substrings

Args

  • string (string): Any string

  • Returns: The string with beginning formatting substrings removed

NPEndPatternExtractor

NPEndPatternExtractor(self, endings, stop_phrases, only_bulleted_lines=True, confidence=95, *args, **kwargs)

Identify noun phrases with certain ending words (e.g 'skills', 'abilities') as skills

Args

  • endings (list): Single words that should identify the ending of a noun phrase as being a skill
  • stop_phrases (list): Noun phrases that should not be considered skills
  • only_bulleted_lines (bool, default True): Whether or not to only consider lines that look like they are items in a list

SkillEndingPatternExtractor

SkillEndingPatternExtractor(self, *args, **kwargs)

Identify noun phrases ending with 'skill' or 'skills' as skills

AbilityEndingPatternExtractor

AbilityEndingPatternExtractor(self, *args, **kwargs)

Identify noun phrases ending in 'ability' or 'abilities' as skills

skills_ml.algorithms.skill_extractors.noun_phrase_ending

Use noun phrases with specific endings to extract skills from job postings

sentences_words_pos

sentences_words_pos(document)

Chops raw text into part-of-speech (POS)-tagged words in sentences

Args

document (string) A document in text format
  • Returns: (list) of sentences, each being a list of word/POS pair

Example

sentences_words_pos(
    '* Develop and maintain relationship with key members of ' +
    'ESPN’s Spanish speaking editorial team'
)
[ # list of sentences
    [ # list of word/POS pairs
        ('*', 'NN'),
        ('Develop', 'NNP'),
        ('and', 'CC'),
        ('maintain', 'VB'),
        ('relationship', 'NN'),
        ('with', 'IN'),
        ('key', 'JJ'),
        ('members', 'NNS'),
        ('of', 'IN'),
        ('ESPN', 'NNP'),
        ('’', 'NNP'),
        ('s', 'VBD'),
        ('Spanish', 'JJ'),
        ('speaking', 'NN'),
        ('editorial', 'NN'),
        ('team', 'NN')
    ]
]

noun_phrases_in_line_with_context

noun_phrases_in_line_with_context(line)

Generate noun phrases in the given line of text

Args

  • text (string): A line of raw text

Yields

tuples, each with two strings

    - a noun phrase
    - the context of the noun phrase (currently defined as the surrounding sentence)

is_bulleted

is_bulleted(string)

Whether or not a given string begins a 'bullet' character

A bullet character is understood to indicate list membership. Differeing common bullet characters are checked.

Args

  • string (string): Any string

  • Returns: (bool) whether or not the string begins with one of the characters in a predefined list of common bullets

clean_beginning

clean_beginning(string)

Clean the beginning of a string of common undesired formatting substrings

Args

  • string (string): Any string

  • Returns: The string with beginning formatting substrings removed

NPEndPatternExtractor

NPEndPatternExtractor(self, endings, stop_phrases, only_bulleted_lines=True, confidence=95, *args, **kwargs)

Identify noun phrases with certain ending words (e.g 'skills', 'abilities') as skills

Args

  • endings (list): Single words that should identify the ending of a noun phrase as being a skill
  • stop_phrases (list): Noun phrases that should not be considered skills
  • only_bulleted_lines (bool, default True): Whether or not to only consider lines that look like they are items in a list

SkillEndingPatternExtractor

SkillEndingPatternExtractor(self, *args, **kwargs)

Identify noun phrases ending with 'skill' or 'skills' as skills

AbilityEndingPatternExtractor

AbilityEndingPatternExtractor(self, *args, **kwargs)

Identify noun phrases ending in 'ability' or 'abilities' as skills

skills_ml.algorithms.skill_extractors.section_extract

SectionExtractSkillExtractor

SectionExtractSkillExtractor(self, section_regex=None, *args, **kwargs)

Extract skills from text by extracting sentences from matching 'sections'.

Heavily utilizes skills_ml.algorithms.nlp.section_extract. For more detail on how to define 'sections', refer to its docstring.

skills_ml.algorithms.skill_extractors.soc_exact

SocScopedExactMatchSkillExtractor

SocScopedExactMatchSkillExtractor(self, competency_ontology, *args, **kwargs)

Extract skills from unstructured text, but only return matches that agree with a known taxonomy

skills_ml.algorithms.skill_extractors.symspell

SymSpell

SymSpell(self, max_dictionary_edit_distance=2, prefix_length=7, count_threshold=1)

SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm. The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster and language independent. Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters! SymSpell supports compound splitting / decompounding of multi-word input strings with three cases

  1. mistakenly inserted space into a correct word led to two incorrect terms
  2. mistakenly omitted space between two correct words led to one incorrect combined term
  3. multiple independent input terms with/without spelling errors See https://github.com/wolfgarbe/SymSpell for details. Args

  4. max_dictionary_edit_distance (int, optional): Maximum distance used to generate index. Also acts as an upper bound for max_edit_distance parameter in lookup() method. Defaults to 2.

  5. prefix_length (int, optional): Prefix length. Should not be changed normally. Defaults to 7.
  6. count_threshold (int, optional): Threshold corpus-count value for words to be considered correct. Defaults to 1, values below zero are also mapped to 1. Consider setting a higher value if your corpus contains mistakes.

skills_ml.algorithms.skill_feature_creator

SequenceFeatureCreator

SequenceFeatureCreator(self, job_posting_generator, sentence_tokenizer=None, word_tokenizer=None, features=None, embedding_model=None)

Sequence Feature Creator helps users to instantiate different types of feature at once and combine them together into a sentence(sequence) feature array for sequence modeling. It's a generator that outputs a sentence array at a time. A sentence array is composed of word vectors.

Example

from skills_ml.algorithms.skill_feature_creator import FeatureCreator

feature_vector_generator = FeatureCreator(job_posting_generator)
feature_vector_generator = FeatureCreator(job_posting_generator, features=["StructuralFeature", "EmbeddingFeature"])

Args

  • job_posting_generator (generator): job posting generator.
  • sentence_tokenizer (func): sentence tokenization function
  • word_tokenizer (func): word tokenization function
  • features (list): list of feature types ones want to include. If it's None or by default, it includes all the feature types.

Yield

sentence_array (numpy.array): an array of word vectors represents the words and punctuations in the sentence. The dimension
                              is (# of words)*(dimension of the concat word vector)

StructuralFeature

StructuralFeature(self, sentence_tokenizer=None, word_tokenizer=None, **kwargs)

Sturctural features

ContextualFeature

ContextualFeature(self, sentence_tokenizer=None, word_tokenizer=None, **kwargs)

Contextual features

EmbeddingFeature

EmbeddingFeature(self, sentence_tokenizer=None, word_tokenizer=None, **kwargs)

Embedding Feature

skills_ml.algorithms.skill_feature_creator.contextual_features

skills_ml.algorithms.skill_feature_creator.posTags

skills_ml.algorithms.skill_feature_creator.structure_features