[TOC]

skills_ml.evaluation.annotators

BratExperiment

BratExperiment(self, experiment_name, brat_s3_path)

Manage a BRAT experiment. Handles

The creation of BRAT config for a specific sample of job postings
Adding users to the installation and allocating them semi-hidden job postings
The parsing of the annotation results at the end of the experiment

Syncs data to an experiment directory on S3. BRAT installations are expected to sync this data down regularly.

Keeps track of a metadata file, available as a dictionary at self.metadata, with the following structure

these first five keys are just storage of user input to either ¶

the constructor or start()¶

view relevant docstrings for definitions ¶

sample_base_path sample_name entities_with_shortcuts minimum_annotations_per_posting max_postings_per_allocation

units and allocations are far more important when reading results of an experiment ¶

units: { # canonical list of 'unit' (bundle of job postings) names, # along with a list of tuples of job posting keys (only unique within unit) and globally unique job posting ids 'unit_1': [ (posting_key_1, job_posting_id_1), (posting_key_2, job_posting_id_2), ], 'unit_2': [ (posting_key_1, job_posting_id_3), (posting_key_2, job_posting_id_4), ] } allocations: { # canonical list of unit assignments to users 'user_1': ['unit_1', 'unit_2'], 'user_2': ['unit_2'] }

skills_ml.evaluation.embedding_metrics

CategorizationMetric

CategorizationMetric(self, clustering:skills_ml.ontologies.clustering.Clustering)

cosine similarity between the clustering concept and the mean vector of all entities within that concept cluster.

IntraClusterCohesion

IntraClusterCohesion(self, clustering:skills_ml.ontologies.clustering.Clustering)

sum of squared error of the centroid of the concept cluster and each entities within the concept cluster.

RecallTopN

RecallTopN(self, clustering:skills_ml.ontologies.clustering.Clustering, topn=20)

For a given concept cluster and a given number n, find top n similar entities from the whole entity pool based on cosin similarity, and then calculate the top n recall: number of the true positive from top n closest entities divided by the total number of the concept cluster.

PrecisionTopN

PrecisionTopN(self, clustering:skills_ml.ontologies.clustering.Clustering, topn=10)

For a given concept cluster and a given number n, find top n similar entities from the whole entity pool based on cosin similarity, and then calculate the top n precision: number of the true positive from top n closest entities divided by n.

skills_ml.evaluation.job_title_normalizers

Test job normalizers

Requires 'interesting_job_titles.csv' to be populated, of format

input job title description of job ONET code

Each task will output two CSV files, one with the normalizer's ranks and one without ranks. The latter is for sending to people to fill out and the former is for testing those results against the normalizer's

Originally written by Kwame Porter Robinson

InputSchema

InputSchema(self, /, *args, **kwargs)

An enumeration listing the data elements and indices taken from source data

InterimSchema

InterimSchema(self, /, *args, **kwargs)

An enumeration listing the data elements and indices after normalization

NormalizerResponse

NormalizerResponse(self, name=None, access=None, num_examples=3)

Abstract interface for enforcing common iteration, access patterns to a variety of possible normalizers.

Args

name (string): A name for the normalizer
access (filename or file object): A tab-delimited CSV with column order {job_title, description, soc_code}
num_examples (int, optional): Number of top responses to include

Normalizers should return a list of results, ordered by relevance, with 'title' and optional 'relevance_score' keys

MiniNormalizer

MiniNormalizer(self, name, access, normalize_class)

Access normalizer classes which can be instantiated and implement 'normalize_job_title(job_title)'

DataAtWorkNormalizer

DataAtWorkNormalizer(self, name=None, access=None, num_examples=3)

skills_ml.evaluation.occ_cls_evaluator

ClassificationEvaluator

ClassificationEvaluator(self, result_generator)

OnetOccupationClassificationEvaluator

OnetOccupationClassificationEvaluator(self, result_generator)

skills_ml.evaluation.representativeness_calculators

Calculate representativeness of a dataset, such as job postings

skills_ml.evaluation.representativeness_calculators.geo_occupation

Computes geographic representativeness of job postings based on ONET SOC Code

GeoOccupationRepresentativenessCalculator

GeoOccupationRepresentativenessCalculator(self, geo_querier=None, normalizer=None)

Calculates geographic representativeness of SOC Codes. If a job normalizer is given, will attempt to compute SOC codes of jobs that have missing SOC codes

Args

geo_querier (skills_ml.job_postings.geography_queriers) An object that can return a CBSA from a job posting
normalizer (skills_ml.algorithms.occupation_classifiers) An object that can return the SOC code from a job posting

skills_ml.evaluation.skill_extraction_metrics

OntologyCompetencyRecall

OntologyCompetencyRecall(self, ontology:skills_ml.ontologies.base.CompetencyOntology)

The percentage of competencies in an ontology which are present in the candidate skills

OntologyOccupationRecall

OntologyOccupationRecall(self, ontology:skills_ml.ontologies.base.CompetencyOntology)

The percentage of occupations in the ontology that are present in the candidate skills

MedianSkillsPerDocument

MedianSkillsPerDocument(self, /, *args, **kwargs)

The median number of distinct skills present in each document

SkillsPerDocumentHistogram

SkillsPerDocumentHistogram(self, bins=10, *args, **kwargs)

The

PercentageNoSkillDocuments

PercentageNoSkillDocuments(self, /, *args, **kwargs)

The percentage of documents that contained zero skills

TotalVocabularySize

TotalVocabularySize(self, /, *args, **kwargs)

The total number of skills represented

TotalOccurrences

TotalOccurrences(self, /, *args, **kwargs)

The total number of candidate skill occurrences

EvaluationSetPrecision

EvaluationSetPrecision(self, candidate_skills:Generator[skills_ml.algorithms.skill_extractors.base.CandidateSkill, NoneType, NoneType], evaluation_set_name:str, strict:bool=True)

Find the precision evaluated against an evaluation set of candidate skills.

Args

candidate_skills (CandidateSkillYielder): A collection of candidate skills to evaluate against
evaluation_set_name (str): A name for the evaluation set of candidate skills. Used in the name of the metric so results from multiple evaluation sets can be compared side-by-side.
strict (bool, default True): Whether or not to enforce the exact location of the match, versus just matching between sets on the same skill name and document. Setting this to False will guard against
```
1. labelers who don't mark every instance of a skill once they found one instance
2. discrepancies in start_index values caused by errant transformation methods
```
However, this could also produce false matches, so use with care.

EvaluationSetRecall

EvaluationSetRecall(self, candidate_skills, evaluation_set_name, strict=True)

Find the recall evaluated against an evaluation set of candidate skills.

Args

candidate_skills (CandidateSkillYielder): A collection of candidate skills to evaluate against
evaluation_set_name (str): A name for the evaluation set of candidate skills. Used in the name of the metric so results from multiple evaluation sets can be compared side-by-side.
strict (bool, default True): Whether or not to enforce the exact location of the match, versus just matching between sets on the same skill name and document. Setting this to False will guard against
```
1. labelers who don't mark every instance of a skill once they found one instance
2. discrepancies in start_index values caused by errant transformation methods
```
However, this could also produce false matches, so use with care.

skills_ml.evaluation.skill_extractors

skills_ml.evaluation.annotators

BratExperiment

these first five keys are just storage of user input to either¶

the constructor or start()¶

view relevant docstrings for definitions¶

units and allocations are far more important when reading results of an experiment¶

skills_ml.evaluation.embedding_metrics

CategorizationMetric

IntraClusterCohesion

RecallTopN

PrecisionTopN

skills_ml.evaluation.job_title_normalizers

InputSchema

InterimSchema

NormalizerResponse

MiniNormalizer

DataAtWorkNormalizer

skills_ml.evaluation.occ_cls_evaluator

ClassificationEvaluator

OnetOccupationClassificationEvaluator

skills_ml.evaluation.representativeness_calculators

skills_ml.evaluation.representativeness_calculators.geo_occupation

GeoOccupationRepresentativenessCalculator

skills_ml.evaluation.skill_extraction_metrics

OntologyCompetencyRecall

OntologyOccupationRecall

MedianSkillsPerDocument

SkillsPerDocumentHistogram

PercentageNoSkillDocuments

TotalVocabularySize

TotalOccurrences

EvaluationSetPrecision

EvaluationSetRecall

skills_ml.evaluation.skill_extractors

these first five keys are just storage of user input to either ¶

view relevant docstrings for definitions ¶

units and allocations are far more important when reading results of an experiment ¶