[TOC]
skills_ml.job_postings.aggregate
skills_ml.job_postings.aggregate.dataset_transform
Track stats of job listing datasets, before and after transformation into the common schema.
DatasetStatsCounter
DatasetStatsCounter(self, dataset_id, quarter)
Accumulate data Dataset ETL statistics for a quarter to show presence and absence of different fields, and the total count of rows
Args
dataset_id (string) A dataset id
quarter (string) The quarter being analyzed
DatasetStatsAggregator
DatasetStatsAggregator(self, dataset_id, s3_conn)
Aggregate data Dataset ETL statistics up to the dataset level
Args
dataset_id (string) A dataset id
s3_conn (boto.Connection) an s3 connection
GlobalStatsAggregator
GlobalStatsAggregator(self, s3_conn)
Aggregate Dataset ETL statistics up to the global level
Args
s3_conn (boto.Connection) an s3 connection
skills_ml.job_postings.aggregate.field_values
Track field value distribution of common schema job postings
FieldValueCounter
FieldValueCounter(self, quarter, field_values)
Accumulate field distribution statistics for common schema job postings
Args
quarter (string) The quarter being analyzed
field_values (list) each entry should be either
1. a field key
2. a tuple, first value field key, second value function to fetch value or values from document
skills_ml.job_postings.aggregate.pandas
Aggregation functions that can be used with pandas dataframes
listy_n_most_common
listy_n_most_common(*params, **kwparams)
Expects each item to be iterable, each sub-item to be addable
AggregateFunction
AggregateFunction(self, returns)
Wrap a function with an attribute that indicates the return type name
skills_ml.job_postings.common_schema
A variety of common-schema job posting collections.
Each class in this module should implement a generator that yields job postings (in the common schema, as a JSON string), and has a 'metadata' attribute so any users of the job postings can inspect meaningful metadata about the postings.
JobPostingCollectionFromS3
JobPostingCollectionFromS3(self, s3_conn, s3_paths, extra_metadata=None)
Stream job posting from s3.
Expects that each will be stored in JSON format, one job posting per line. The s3_path given will be iterated through as a prefix, so job postings may be partitioned under that prefix however you choose. It will look in every file under that prefix.
Example
import json
from airflow.hooks import S3Hook
from skills_ml.job_postings.common_schema import JobPostingGenerator
s3_conn = S3Hook().get_conn()
job_postings_generator = JobPostingCollectionFromS3(s3_conn, s3_path='my-bucket/job_postings_common_schema')
for job_posting in job_postings_generator:
print(job_posting['title'])
Attributes
s3_conn
: a boto s3 connections3_path
: path to the job listings. there may be multiple
JobPostingCollectionSample
JobPostingCollectionSample(self, num_records:int=50)
Stream a finite number of job postings stored within the library.
Example
import json
job_postings = JobPostingCollectionSample()
for job_posting in job_postings:
print(json.loads(job_posting)['title'])
Meant to provide a dependency-less example of common schema job postings
for introduction to the library
Args:
num_records (int): The maximum number of records to return. Defaults to 50 (all postings available)
<h2 id="skills_ml.job_postings.common_schema.generate_job_postings_from_s3">generate_job_postings_from_s3</h2>
```python
generate_job_postings_from_s3(s3_conn, s3_prefix:str) -> Generator[Dict[str, Any], NoneType, NoneType]
Stream all job listings from s3 Args
- s3_conn: a boto s3 connection
- s3_prefix: path to the job listings.
Yields
string in json format representing the next job listing
Refer to sample_job_listing.json for example structure
generate_job_postings_from_s3_multiple_prefixes
generate_job_postings_from_s3_multiple_prefixes(s3_conn, s3_prefixes:str) -> Generator[Dict[str, Any], NoneType, NoneType]
Chain the generators of a list of multiple quarters Args
- s3_conn: a boto s3 connection
- s3_prefixes: paths to job listings
Return
a generator that all generators are chained together into
batches_generator
batches_generator(iterable, batch_size)
Batch generator Args
- iterable: an iterable
- batch_size: batch size
get_onet_occupation
get_onet_occupation(job_posting)
Retrieve the occupation from the job posting
First checks the custom 'onet_soc_code' key, then the standard 'occupationalCategory' key, and falls back to the unknown occupation
skills_ml.job_postings.computed_properties
Encapsulates the computation of some piece of data for job postings, to make aggregation and tabular datasets easy to produce
JobPostingComputedProperty
JobPostingComputedProperty(self, storage, partition_func=None)
Base class for computers of job posting properties.
Using this class, expensive computations can be performed once, stored on S3 per job posting in partitions, and reused in different aggregations.
The base class takes care of all of the serialization and partitioning, leaving subclasses to implement a function for computing the property of a single posting and metadata describing the output of this function.
Subclasses must implement
- _compute_func_on_one to produce a callable that takes in a single
job posting and returns JSON-serializable output representing the computation target.
This function can produce objects that are kept in scope and reused,
so properties that require a large object (e.g. a trained classifier) to do their
computation work can be downloaded from S3 here without requiring the I/O work
to be done over and over. (See .computers.SOCClassifyProperty for illustration)
- property_name attribute (string) that is used when saving the computed properties
- property_columns attribute (list) of ComputedPropertyColumns that
map to the column names output by `_compute_func_on_one`
Args
storage (skills_ml.storage.Store) A storage object in which to store the cached properties.
partition_func (callable, optional) A function that takes a job posting and
outputs a string that should be used as a partition key. Must be deterministic.
Defaults to the 'datePosted' value
The caches will be namespaced by the property name and partition function
ComputedPropertyColumn
ComputedPropertyColumn(self, name, description, compatible_aggregate_function_paths=None)
Metadata about a specific output column of a computed property
Args
name (string) The name of the column
description (string) A description of the column and how it was populated.
- compatible_aggregate_function_paths (dict, optional): If this property is meant to be used in aggregations, map string function paths to descriptions of what the function is computing for this column. All function paths should be compatible with pandas.agg (one argument, an iterable), though multi-argument functions can be used in conjunction with functools.partial
skills_ml.job_postings.computed_properties.aggregators
Aggregate job posting computed properties into tabular datasets
df_for_properties_and_keys
df_for_properties_and_keys(computed_properties, keys)
Assemble a dataframe with the raw data from many computed properties and keys
Args
computed_properties (list of JobPostingComputedProperty)
keys (list of strs)
- Returns: pandas.DataFrame
expand_array_col_to_many_cols
expand_array_col_to_many_cols(base_col, func, aggregation)
Expand an array column created as the result of an .aggregate call into many columns
Args
base_col (string) The name of the base column (before .aggregate)
func (function) The base function that was aggregated on
aggregation (pandas.DataFrame) The post-aggregation dataframe
- Returns: pandas.DataFrame, minus the array column and plus columns for each array value
base_func
base_func(aggregate_function)
Deals with the possibility of functools.partial being applied to a given function. Allows access to the decorated 'return' attribute whether or not it is also a partial function
Args
aggregate_function (callable) Either a raw function or a functools.partial object
- Returns: callable
aggregation_for_properties_and_keys
aggregation_for_properties_and_keys(grouping_properties, aggregate_properties, aggregate_functions, keys)
Assemble an aggregation dataframe for given partition keys
Args
grouping_properties (list of JobPostingComputedProperty)
Properties to form the primary key of the aggregation
aggregate_properties (list of JobPostingComputedProperty)
Properties to be aggregated over the primary key
aggregate_functions (dict) A lookup of aggregate functions
to be applied for each aggregate column
keys (list of str) The desired partition keys for the aggregation to cover
- Returns: pandas.DataFrame indexed on the grouping properties, covering all data from the given keys
aggregate_properties
aggregate_properties(out_filename, grouping_properties, aggregate_properties, aggregate_functions, storage, aggregation_name)
Aggregate computed properties and stores the resulting CSV
Args
out_filename (string) The desired filename (without path) for the .csv
grouping_properties (list of JobPostingComputedProperty)
Properties to form the primary key of the aggregation
aggregate_properties (list of JobPostingComputedProperty)
Properties to be aggregated over the primary key
aggregate_functions (dict) A lookup of aggregate functions
to be applied for each aggregate column
aggregations_path (string) The base s3 path to store aggregations
aggregation_name (string) The name of this particular aggregation
- Returns: nothing
skills_ml.job_postings.computed_properties.computers
Various computers of job posting properties. Each class is generally a generic algorithm (such as skill extraction or occupation classification) paired with enough configuration to run on its own
TitleCleanPhaseOne
TitleCleanPhaseOne(self, storage, partition_func=None)
Perform one phase of job title cleaning: lowercase/remove punctuation
TitleCleanPhaseTwo
TitleCleanPhaseTwo(self, storage, partition_func=None)
Perform two phases of job title cleaning
- lowercase/remove punctuation
- Remove geography information
Geography
Geography(self, geo_querier, *args, **kwargs)
Produce a geography by querying a given JobGeographyQuerier
Args
geo_querier
SOCClassifyProperty
SOCClassifyProperty(self, classifier_obj, *args, **kwargs)
Classify the SOC code from a trained classifier
Args
classifier_obj (object, optional) An object to use as a classifier.
If not sent one will be downloaded from s3
GivenSOC
GivenSOC(self, storage, partition_func=None)
Assign the SOC code given by the partner
HourlyPay
HourlyPay(self, storage, partition_func=None)
The pay given in the baseSalary field if salaryFrequency is hourly
YearlyPay
YearlyPay(self, storage, partition_func=None)
The pay given in the baseSalary field if salaryFrequency is yearly
SkillCounts
SkillCounts(self, skill_extractor, *args, **kwargs)
Adding top skill counts from a skill extractor
Args: (skills_ml.algorithms.skill_extractors.base.SkillExtractorBase) A skill extractor object
PostingIdPresent
PostingIdPresent(self, storage, partition_func=None)
Records job posting ids. Used for counting job postings
skills_ml.job_postings.corpora
CorpusCreator
CorpusCreator(self, job_posting_generator=None, document_schema_fields=['description', 'experienceRequirements', 'qualifications', 'skills'], raw=False)
A base class for objects that convert common schema job listings into a corpus in documnet level suitable for use by machine learning algorithms or specific tasks.
Example
from skills_ml.job_postings.common_schema import JobPostingCollectionSample
from skills_ml.job_postings.corpora.basic import CorpusCreator
job_postings_generator = JobPostingCollectionSample()
# Default will include all the cleaned job postings
corpus = CorpusCreator(job_postings_generator)
# For getting a the raw job postings without any cleaning
corpus = CorpusCreator(job_postings_generator, raw=True)
Attributes
job_posting_generator (generator)
: an iterable that generates JSON strings. Each string is expected to represent a job listing conforming to the common schema See sample_job_listing.json for an example of this schemadocument_schema_fields (list)
: an list of schema fields to be includedraw (bool)
: a flag whether to return the raw documents or transformed documents
Yield
(dict): a dictinary only with selected fields as keys and corresponding raw/cleaned value
SimpleCorpusCreator
SimpleCorpusCreator(self, job_posting_generator=None, document_schema_fields=['description', 'experienceRequirements', 'qualifications', 'skills'], raw=False)
An object that transforms job listing documents by picking important schema fields and returns them as one large lowercased string
Doc2VecGensimCorpusCreator
Doc2VecGensimCorpusCreator(self, job_posting_generator, document_schema_fields=['description', 'experienceRequirements', 'qualifications', 'skills'], *args, **kwargs)
Corpus for training Gensim Doc2Vec An object that transforms job listing documents by picking important schema fields and yields them as one large cleaned array of words
Example
from skills_ml.job_postings.common_schema import JobPostingCollectionSample
from skills_ml.job_postings.corpora.basic import Doc2VecGensimCorpusCreator
job_postings_generator = JobPostingCollectionSample()
corpus = Doc2VecGensimCorpusCreator(job_postings_generator)
Attributes:
job_posting_generator (generator): a job posting generator
document_schema_fields (list): an list of schema fields to be included
<h2 id="skills_ml.job_postings.corpora.Word2VecGensimCorpusCreator">Word2VecGensimCorpusCreator</h2>
```python
Word2VecGensimCorpusCreator(self, job_posting_generator, document_schema_fields=['description', 'experienceRequirements', 'qualifications', 'skills'], *args, **kwargs)
An object that transforms job listing documents by picking important schema fields and yields them as one large cleaned array of words
JobCategoryCorpusCreator
JobCategoryCorpusCreator(self, job_posting_generator=None, document_schema_fields=['description', 'experienceRequirements', 'qualifications', 'skills'], raw=False)
An object that extract the label of each job listing document which could be onet soc code or occupationalCategory and yields them as a lowercased string
SectionExtractWord2VecCorpusCreator
SectionExtractWord2VecCorpusCreator(self, section_regex, *args, **kwargs)
Only return the contents of the configured section headers.
Heavily utilizes skills_ml.algorithms.nlp.section_extract. For more detail on how to define 'sections', refer to its docstring.
RawCorpusCreator
RawCorpusCreator(self, job_posting_generator, document_schema_fields=['description', 'experienceRequirements', 'qualifications', 'skills'])
An object that yields the joined raw string of job posting
skills_ml.job_postings.filtering
Filtering streamed job postings
soc_major_group_filter
soc_major_group_filter(major_groups:List) -> Callable
Return a function that checks the ONET Soc Code of a job posting (if it is present) against the configured major groups.
JobPostingFilterer
JobPostingFilterer(self, job_posting_generator:Generator[Dict[str, Any], NoneType, NoneType], filter_funcs:List[Callable])
Filter common schema job postings through a number of filtering functions
Args
- job_posting_generator: An iterable of job postings (each in dict form)
- filter_funcs: A list of filtering functions, each taking in a job posting document (as dict) and returning a boolean instructing whether or not the posting passes the filter
skills_ml.job_postings.geography_queriers
Extracting geographies from job posting datasets
job_posting_search_strings
job_posting_search_strings(job_posting)
Convert a job posting to a geocode-ready search string
Includes city and state if present, or just city
Args
job_posting (dict) A job posting in schema.org/JobPosting json form
- Returns: (string) A geocode-ready search string
skills_ml.job_postings.geography_queriers.base
JobGeographyQuerier
JobGeographyQuerier(self, /, *args, **kwargs)
Base class for retrievers/computers of geography data from job postings
The main interface is query(job_posting), which returns a tuple the same length as self.output_columns.
Subclasses must implement
output_columns (property/attribute): a collection of two-tuples with a name and description
for each column output by the querier
name (property/attribute) a name of the querier
_query(job_posting) to take a job posting object and return
a tuple of the same length as self.output_columns.
skills_ml.job_postings.geography_queriers.cbsa
Look up the CBSA for a job posting from a census crosswalk (job location -> Census Place -> Census UA -> Census CBSA)
JobCBSAFromGeocodeQuerier
JobCBSAFromGeocodeQuerier(self, geocoder, cbsa_finder)
Queries the Core-Based Statistical Area for a job
This object delegates the CBSA-finding algorithm to a passed-in finder.
In practice, you can look at the skills_ml.algorithms.geocoders.cbsa
module for an example of how this can be generated.
Instead, this object focuses on the job posting-centric logic necessary, such as converting the job posting to the form needed to use the cache and dealing with differents kinds of cache misses.
Args
cbsa_finder (dict) A mapping of geocoding search strings to
(CBSA FIPS, CBSA Name) tuples
JobCBSAFromCrosswalkQuerier
JobCBSAFromCrosswalkQuerier(self)
Queries the Core-Based Statistical Area for a job using a census crosswalk
First looks up a Place or County Subdivision by the job posting's state and city. If it finds a result, it will then take the Urbanized Area for that Place or County Subdivison and find CBSAs associated with it.
Queries return all hits, so there may be multiple CBSAs for a given query.
skills_ml.job_postings.geography_queriers.state
skills_ml.job_postings.raw
skills_ml.job_postings.raw.usajobs
Import USAJobs postings into the Open Skills common schema
USAJobsTransformer
USAJobsTransformer(self, bucket_name=None, prefix=None, **kwargs)
skills_ml.job_postings.raw.virginia
VirginiaTransformer
VirginiaTransformer(self, bucket_name=None, prefix=None, **kwargs)
skills_ml.job_postings.sample
Sample job postings
JobSampler
JobSampler(self, job_posting_generator, k, weights=None, key=<function JobSampler.<lambda> at 0x7f00f9c8d400>, random_state=None)
Job posting sampler using reservoir sampling methods
It takes a job_posting generator as an input. To sample based on weights, one should sepecify a weight dictionary.
Attributes
job_posting_generator (iterator)
: Job posting iterator to sample from.k (int)
: number of documents to sampleweights (dict)
: a dictionary that has key-value pairs as label-weighting pairs. It expects every label in the iterator to be present as a key in the weights dictionary For example,weights = {'11'
: 2, '13', 1}. In this case, the label/key is the occupation major group and the value is the weight you want to sample with.key (callable)
: a function to be called on each element to associate to the key of weights dictionaryrandom_state (int)
: the seed used by the random number generator