Skip to content

Commit e73c2b5

Browse files
authored
Add data to repo (#240)
* update torch due to sentence-transformers changes * Update torch to 2.3.0 * Add esco data * set torch version * push torch back * pin transformer to stop torch issue in python 3.10, and dont allow spacy >3.8 due to another version issue with blis * Read data from package location, and saves taxonomy embeddings when they are calculated for the first time * Add other datasets to git * rename esco taxonomy to v_1_1_1 to make it clearer, add this in the config file * Try to download taxonomy embeddings from huggingface hub, if not then they get calculated on the fly * Refresh readmes with new outputs * Add newly calculated metrics to pipeline summary doc page * correct for unmatched skills * Update package version to major change
1 parent 1335cbd commit e73c2b5

19 files changed

+132216
-156
lines changed

README.md

Lines changed: 8 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -26,38 +26,23 @@ To install as a package:
2626
pip install ojd-daps-skills
2727
```
2828

29-
Note: If you are using a conda environment you may need to do `conda install scipy` before pip installing this library.
29+
> 🐍 **NOTE:** If you are using a conda environment you may need to do `conda install scipy` before pip installing this library.
30+
31+
> **NOTE:** The first time you import `SkillsExtractor` in python it will take some time (around a minute) to load.
3032
3133
To extract skills from a job advert:
3234

3335
```
3436
from ojd_daps_skills.extract_skills.extract_skills import SkillsExtractor
3537
36-
sm = SkillsExtractor(taxonomy_name="toy")
37-
38-
✘ nestauk/en_skillner NER model not loaded. Downloading model...
39-
Collecting en-skillner==any
40-
Downloading https://huggingface.co/nestauk/en_skillner/resolve/main/en_skillner-any-py3-none-any.whl (587.7 MB)
41-
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 MB 5.1 MB/s eta 0:00:0000:0100:01
42-
Installing collected packages: en-skillner
43-
Successfully installed en-skillner-3.7.1
44-
✘ Multi-skill classifier not loaded. Downloading model...
45-
Fetching 4 files: 100%|██████████| 4/4 [00:00<00:00, 26843.55it/s]
46-
✘ Neccessary data files are not downloaded. Downloading ~0.5GB of
47-
neccessary data files to
48-
/Users/india.kerlenesta/Projects/nesta/ojd_daps/ojd_daps_extension/ojd_daps_skills/ojd_daps_skills_data.
49-
ℹ Data folder downloaded from
50-
/Users/india.kerlenesta/Projects/nesta/ojd_daps/ojd_daps_extension/ojd_daps_skills/ojd_daps_skills_data
38+
sm = SkillsExtractor(taxonomy_name="toy") # Can also use "esco" or "lightcast" here
5139
5240
job_ads = [
5341
"The job involves communication skills and maths skills",
5442
"The job involves Excel skills. You will also need good presentation skills",
5543
"You will need experience in the IT sector.",
5644
]
5745
job_ad_with_skills = sm(job_ads)
58-
59-
ℹ Getting embeddings for 3 texts ...
60-
ℹ Took 0.018199920654296875 seconds
6146
```
6247

6348
To access the extracted and mapped skills for each inputted job advert:
@@ -78,15 +63,15 @@ Which returns:
7863

7964
```
8065
Job advert: The job involves communication skills and maths skills
81-
Entities found: [('communication skills', 'SKILL'), ('maths', 'SKILL')]
82-
Skill spans: [communication skills, maths]
83-
Skills mapped: [{'ojo_skill': 'communication skills', 'ojo_skill_id': 3144285826919113, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S1'}, {'ojo_skill': 'maths', 'ojo_skill_id': 2887431344496880, 'match_skill': 'working with computers', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S5'}]
66+
Entities found: [('communication skills', 'SKILL'), ('maths skills', 'SKILL')]
67+
Skill spans: [communication skills, maths skills]
68+
Skills mapped: [{'ojo_skill': 'communication skills', 'ojo_skill_id': 3144285826919113, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S1'}, {'ojo_skill': 'maths skills', 'ojo_skill_id': 1654958883999821, 'match_skill': 'working with computers', 'match_score': 0.6666666666666666, 'match_type': 'most_common_level_1', 'match_id': 'S5'}]
8469
8570
8671
Job advert: The job involves Excel skills. You will also need good presentation skills
8772
Entities found: [('Excel', 'SKILL'), ('presentation skills', 'SKILL')]
8873
Skill spans: [Excel, presentation skills]
89-
Skills mapped: [{'ojo_skill': 'Excel', 'ojo_skill_id': 2576630861021310, 'match_skill': 'use spreadsheets software', 'match_score': 0.7379249448453751, 'match_type': 'skill', 'match_id': 'abcd'}, {'ojo_skill': 'presentation skills', 'ojo_skill_id': 1846141317334203, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.5, 'match_type': 'most_common_level_1', 'match_id': 'S1'}]
74+
Skills mapped: [{'ojo_skill': 'Excel', 'ojo_skill_id': 2576630861021310, 'match_skill': 'use spreadsheets software', 'match_score': 0.7379249334335327, 'match_type': 'skill', 'match_id': 'abcd'}, {'ojo_skill': 'presentation skills', 'ojo_skill_id': 1846141317334203, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.5, 'match_type': 'most_common_level_1', 'match_id': 'S1'}]
9075
9176
9277
Job advert: You will need experience in the IT sector.

docs/index.md

Lines changed: 9 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -26,44 +26,27 @@ You can use pip to install the library:
2626

2727
`pip install ojd-daps-skills`
2828

29-
Note: If you are using a conda environment you may need to do `conda install scipy` before pip installing this library.
29+
> 🐍 **NOTE:** If you are using a conda environment you may need to do `conda install scipy` before pip installing this library.
3030
31-
Note that this package was developed on MacOS and tested on Ubuntu. Changes have been made to be compatible on a Windows system but are not tested and cannot be guaranteed.
32-
33-
When the package is first used it will automatically download a folder of neccessary data and models (~1GB).
31+
> 💻 **NOTE:** This package was developed on MacOS and tested on Ubuntu. Changes have been made to be compatible on a Windows system but are not tested and cannot be guaranteed.
3432
3533
## TL;DR: Using Nesta’s Skills Extractor library
3634

35+
> **NOTE:** The first time you import `SkillsExtractor` in python it will take some time (around a minute) to load.
36+
3737
To extract skills from a job advert:
3838

3939
```
4040
from ojd_daps_skills.extract_skills.extract_skills import SkillsExtractor
4141
42-
sm = SkillsExtractor(taxonomy_name="toy")
43-
44-
✘ nestauk/en_skillner NER model not loaded. Downloading model...
45-
Collecting en-skillner==any
46-
Downloading https://huggingface.co/nestauk/en_skillner/resolve/main/en_skillner-any-py3-none-any.whl (587.7 MB)
47-
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 MB 5.1 MB/s eta 0:00:0000:0100:01
48-
Installing collected packages: en-skillner
49-
Successfully installed en-skillner-3.7.1
50-
✘ Multi-skill classifier not loaded. Downloading model...
51-
Fetching 4 files: 100%|██████████| 4/4 [00:00<00:00, 26843.55it/s]
52-
✘ Neccessary data files are not downloaded. Downloading ~0.5GB of
53-
neccessary data files to
54-
/Users/india.kerlenesta/Projects/nesta/ojd_daps/ojd_daps_extension/ojd_daps_skills/ojd_daps_skills_data.
55-
ℹ Data folder downloaded from
56-
/Users/india.kerlenesta/Projects/nesta/ojd_daps/ojd_daps_extension/ojd_daps_skills/ojd_daps_skills_data
42+
sm = SkillsExtractor(taxonomy_name="toy") # Can also use "esco" or "lightcast" here
5743
5844
job_ads = [
5945
"The job involves communication skills and maths skills",
6046
"The job involves Excel skills. You will also need good presentation skills",
6147
"You will need experience in the IT sector.",
6248
]
6349
job_ad_with_skills = sm(job_ads)
64-
65-
ℹ Getting embeddings for 3 texts ...
66-
ℹ Took 0.018199920654296875 seconds
6750
```
6851

6952
To access the extracted and mapped skills for each inputted job advert:
@@ -84,15 +67,15 @@ Which returns:
8467

8568
```
8669
Job advert: The job involves communication skills and maths skills
87-
Entities found: [('communication skills', 'SKILL'), ('maths', 'SKILL')]
88-
Skill spans: [communication skills, maths]
89-
Skills mapped: [{'ojo_skill': 'communication skills', 'ojo_skill_id': 3144285826919113, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S1'}, {'ojo_skill': 'maths', 'ojo_skill_id': 2887431344496880, 'match_skill': 'working with computers', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S5'}]
70+
Entities found: [('communication skills', 'SKILL'), ('maths skills', 'SKILL')]
71+
Skill spans: [communication skills, maths skills]
72+
Skills mapped: [{'ojo_skill': 'communication skills', 'ojo_skill_id': 3144285826919113, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S1'}, {'ojo_skill': 'maths skills', 'ojo_skill_id': 1654958883999821, 'match_skill': 'working with computers', 'match_score': 0.6666666666666666, 'match_type': 'most_common_level_1', 'match_id': 'S5'}]
9073
9174
9275
Job advert: The job involves Excel skills. You will also need good presentation skills
9376
Entities found: [('Excel', 'SKILL'), ('presentation skills', 'SKILL')]
9477
Skill spans: [Excel, presentation skills]
95-
Skills mapped: [{'ojo_skill': 'Excel', 'ojo_skill_id': 2576630861021310, 'match_skill': 'use spreadsheets software', 'match_score': 0.7379249448453751, 'match_type': 'skill', 'match_id': 'abcd'}, {'ojo_skill': 'presentation skills', 'ojo_skill_id': 1846141317334203, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.5, 'match_type': 'most_common_level_1', 'match_id': 'S1'}]
78+
Skills mapped: [{'ojo_skill': 'Excel', 'ojo_skill_id': 2576630861021310, 'match_skill': 'use spreadsheets software', 'match_score': 0.7379249334335327, 'match_type': 'skill', 'match_id': 'abcd'}, {'ojo_skill': 'presentation skills', 'ojo_skill_id': 1846141317334203, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.5, 'match_type': 'most_common_level_1', 'match_id': 'S1'}]
9679
9780
9881
Job advert: You will need experience in the IT sector.

docs/pipeline_summary.md

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,43 @@ For further information or feedback please contact Liz Gallagher, India Kerle or
2323
- Out of scope is extracting and matching skills from job adverts in non-English languages; extracting and matching skills from texts other than job adverts; drawing conclusions on new, unidentified skills.
2424
- Skills extracted should not be used to determine skill demand without expert steer and input nor should be used for any discriminatory hiring practices.
2525

26-
## Metrics
26+
## Metrics - The model trained on data from 8th August 2023 (correct as of 29th May 2025)
2727

28-
There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares. The analysis in this section was performed using the results of the `20220825` model. We believe the newer `20230808` model will improve these results, but the analysis hasn't been repeated.
28+
There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares.
29+
30+
### Evaluation 2 - Manual judgement of skills extraction and mapping quality
31+
32+
We manually tagged a random sample of skills extracted from job adverts, with whether we thought they were inappropriate, OK or excellent skill entities, and whether we thought they had inappropriate, OK or excellent matches to ESCO skills (or other parts of the taxonomy).
33+
34+
- We felt that out of 202 skill entities 73% were excellent entities, 17% were OK and 10% were inappropriate.
35+
- 192 of the 202 skill entities were matched to ESCO skills or parts of the taxonomy.
36+
- Of the 192 matched skills, we felt 45% were excellently matched, 27% were OK and 27% were inappropriate.
37+
- Of the 96 skills matched to ESCO skills, we felt 71% were excellently matched, 24% were OK and 5% were inappropriate.
38+
39+
| Skill entity quality | ESCO match quality | count |
40+
| -------------------- | ------------------ | ----- |
41+
| Inappropriate | Inappropriate | 18 |
42+
| Inappropriate | OK | 3 |
43+
| OK | Inappropriate | 15 |
44+
| OK | OK | 15 |
45+
| OK | Excellent | 4 |
46+
| Excellent | Inappropriate | 19 |
47+
| Excellent | OK | 36 |
48+
| Excellent | Excellent | 92 |
49+
50+
- 89% of the matches were to either an individual skill or the lowest level of the skills taxonomy (level 3).
51+
- The match quality is at its best when the skill entity is matched to an individual ESCO skill.
52+
53+
| Taxonomy level mapped to | Number in sample | Average match quality score (0-inappropriate, 1-OK, 2-excellent) |
54+
| ------------------------ | ---------------- | ---------------------------------------------------------------- |
55+
| Skill | 96 | 1.66 |
56+
| Skill hierarchy level 3 | 84 | 0.70 |
57+
| Skill hierarchy level 2 | 7 | 1 |
58+
| Skill hierarchy level 1 | 5 | 0.40 |
59+
60+
## Metrics - The model trained on data from 25th August 2022
61+
62+
> ⚠️ **NOTE:** The analysis in this section was performed using the results of the `20220825` model. We believe the newer `20230808` model will improve these results, but the analysis hasn't been repeated apart from 'Evaluation 2' discussed above.
2963
3064
### Comparison 1 - Top skill groups per occupation comparison to ESCO essential skill groups per occupation
3165

@@ -93,5 +127,5 @@ We manually tagged a random sample of skills extracted from job adverts, with wh
93127
| Skill hierarchy level 3 | 51 | 0.90 |
94128
| Attitudes hierarchy | 8 | 1.63 |
95129
| Skill hierarchy level 2 | 6 | 0.33 |
96-
| Knoweldge hierarchy | 6 | 0.17 |
130+
| Knowledge hierarchy | 6 | 0.17 |
97131
| Transversal hierarchy | 1 | 1.00 |

ojd_daps_skills/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import warnings
44
from pathlib import Path
55
from typing import Optional
6+
import importlib.resources
67

78
import yaml
89
from spacy.tokens import Doc
@@ -27,8 +28,8 @@ def get_yaml_config(file_path: Path) -> Optional[dict]:
2728

2829
bucket_name = "open-jobs-lake"
2930

30-
PUBLIC_DATA_FOLDER_PATH = PROJECT_DIR / "ojd_daps_skills_data"
3131
PUBLIC_MODEL_FOLDER_PATH = PROJECT_DIR / "ojd_daps_skills_models"
32+
PACKAGE_PATH = importlib.resources.files("ojd_daps_skills")
3233

3334

3435
def setup_spacy_extensions():

ojd_daps_skills/configs/extract_skills_esco.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
taxonomy_name: "esco"
2+
taxonomy_version: "v_1_1_1"
23
num_hier_levels: 4
34
skill_type_dict:
45
{

0 commit comments

Comments
 (0)