[TOC]
skills_ml.evaluation.annotators
BratExperiment
BratExperiment(self, experiment_name, brat_s3_path)
Manage a BRAT experiment. Handles
- The creation of BRAT config for a specific sample of job postings
- Adding users to the installation and allocating them semi-hidden job postings
- The parsing of the annotation results at the end of the experiment
Syncs data to an experiment directory on S3. BRAT installations are expected to sync this data down regularly.
Keeps track of a metadata file, available as a dictionary at self.metadata, with the following structure
these first five keys are just storage of user input to either¶
the constructor or start()¶
view relevant docstrings for definitions¶
sample_base_path sample_name entities_with_shortcuts minimum_annotations_per_posting max_postings_per_allocation
units and allocations are far more important when reading results of an experiment¶
units: { # canonical list of 'unit' (bundle of job postings) names, # along with a list of tuples of job posting keys (only unique within unit) and globally unique job posting ids 'unit_1': [ (posting_key_1, job_posting_id_1), (posting_key_2, job_posting_id_2), ], 'unit_2': [ (posting_key_1, job_posting_id_3), (posting_key_2, job_posting_id_4), ] } allocations: { # canonical list of unit assignments to users 'user_1': ['unit_1', 'unit_2'], 'user_2': ['unit_2'] }
skills_ml.evaluation.embedding_metrics
CategorizationMetric
CategorizationMetric(self, clustering:skills_ml.ontologies.clustering.Clustering)
cosine similarity between the clustering concept and the mean vector of all entities within that concept cluster.
IntraClusterCohesion
IntraClusterCohesion(self, clustering:skills_ml.ontologies.clustering.Clustering)
sum of squared error of the centroid of the concept cluster and each entities within the concept cluster.
RecallTopN
RecallTopN(self, clustering:skills_ml.ontologies.clustering.Clustering, topn=20)
For a given concept cluster and a given number n, find top n similar entities from the whole entity pool based on cosin similarity, and then calculate the top n recall: number of the true positive from top n closest entities divided by the total number of the concept cluster.
PrecisionTopN
PrecisionTopN(self, clustering:skills_ml.ontologies.clustering.Clustering, topn=10)
For a given concept cluster and a given number n, find top n similar entities from the whole entity pool based on cosin similarity, and then calculate the top n precision: number of the true positive from top n closest entities divided by n.
skills_ml.evaluation.job_title_normalizers
Test job normalizers
Requires 'interesting_job_titles.csv' to be populated, of format
input job title description of job ONET code
Each task will output two CSV files, one with the normalizer's ranks and one without ranks. The latter is for sending to people to fill out and the former is for testing those results against the normalizer's
Originally written by Kwame Porter Robinson
InputSchema
InputSchema(self, /, *args, **kwargs)
An enumeration listing the data elements and indices taken from source data
InterimSchema
InterimSchema(self, /, *args, **kwargs)
An enumeration listing the data elements and indices after normalization
NormalizerResponse
NormalizerResponse(self, name=None, access=None, num_examples=3)
Abstract interface for enforcing common iteration, access patterns to a variety of possible normalizers.
Args
- name (string): A name for the normalizer
- access (filename or file object): A tab-delimited CSV with column order {job_title, description, soc_code}
- num_examples (int, optional): Number of top responses to include
Normalizers should return a list of results, ordered by relevance, with 'title' and optional 'relevance_score' keys
MiniNormalizer
MiniNormalizer(self, name, access, normalize_class)
Access normalizer classes which can be instantiated and implement 'normalize_job_title(job_title)'
DataAtWorkNormalizer
DataAtWorkNormalizer(self, name=None, access=None, num_examples=3)
skills_ml.evaluation.occ_cls_evaluator
ClassificationEvaluator
ClassificationEvaluator(self, result_generator)
OnetOccupationClassificationEvaluator
OnetOccupationClassificationEvaluator(self, result_generator)
skills_ml.evaluation.representativeness_calculators
Calculate representativeness of a dataset, such as job postings
skills_ml.evaluation.representativeness_calculators.geo_occupation
Computes geographic representativeness of job postings based on ONET SOC Code
GeoOccupationRepresentativenessCalculator
GeoOccupationRepresentativenessCalculator(self, geo_querier=None, normalizer=None)
Calculates geographic representativeness of SOC Codes. If a job normalizer is given, will attempt to compute SOC codes of jobs that have missing SOC codes
Args
geo_querier (skills_ml.job_postings.geography_queriers) An object that can return a CBSA from a job posting
normalizer (skills_ml.algorithms.occupation_classifiers) An object that can return the SOC code from a job posting
skills_ml.evaluation.skill_extraction_metrics
OntologyCompetencyRecall
OntologyCompetencyRecall(self, ontology:skills_ml.ontologies.base.CompetencyOntology)
The percentage of competencies in an ontology which are present in the candidate skills
OntologyOccupationRecall
OntologyOccupationRecall(self, ontology:skills_ml.ontologies.base.CompetencyOntology)
The percentage of occupations in the ontology that are present in the candidate skills
MedianSkillsPerDocument
MedianSkillsPerDocument(self, /, *args, **kwargs)
The median number of distinct skills present in each document
SkillsPerDocumentHistogram
SkillsPerDocumentHistogram(self, bins=10, *args, **kwargs)
The
PercentageNoSkillDocuments
PercentageNoSkillDocuments(self, /, *args, **kwargs)
The percentage of documents that contained zero skills
TotalVocabularySize
TotalVocabularySize(self, /, *args, **kwargs)
The total number of skills represented
TotalOccurrences
TotalOccurrences(self, /, *args, **kwargs)
The total number of candidate skill occurrences
EvaluationSetPrecision
EvaluationSetPrecision(self, candidate_skills:Generator[skills_ml.algorithms.skill_extractors.base.CandidateSkill, NoneType, NoneType], evaluation_set_name:str, strict:bool=True)
Find the precision evaluated against an evaluation set of candidate skills.
Args
- candidate_skills (CandidateSkillYielder): A collection of candidate skills to evaluate against
- evaluation_set_name (str): A name for the evaluation set of candidate skills. Used in the name of the metric so results from multiple evaluation sets can be compared side-by-side.
- strict (bool, default True): Whether or not to enforce the exact location of the match,
versus just matching between sets on the same skill name and document.
Setting this to False will guard against
1. labelers who don't mark every instance of a skill once they found one instance 2. discrepancies in start_index values caused by errant transformation methods
However, this could also produce false matches, so use with care.
EvaluationSetRecall
EvaluationSetRecall(self, candidate_skills, evaluation_set_name, strict=True)
Find the recall evaluated against an evaluation set of candidate skills.
Args
- candidate_skills (CandidateSkillYielder): A collection of candidate skills to evaluate against
- evaluation_set_name (str): A name for the evaluation set of candidate skills. Used in the name of the metric so results from multiple evaluation sets can be compared side-by-side.
- strict (bool, default True): Whether or not to enforce the exact location of the match,
versus just matching between sets on the same skill name and document.
Setting this to False will guard against
1. labelers who don't mark every instance of a skill once they found one instance 2. discrepancies in start_index values caused by errant transformation methods
However, this could also produce false matches, so use with care.