nestauk
diff --git a/‎README.md
Lines changed: 8 additions & 23 deletions b/‎README.md
Lines changed: 8 additions & 23 deletions
diff --git a/‎docs/index.md
Lines changed: 9 additions & 26 deletions b/‎docs/index.md
Lines changed: 9 additions & 26 deletions
diff --git a/‎docs/pipeline_summary.md
Lines changed: 37 additions & 3 deletions b/‎docs/pipeline_summary.md
Lines changed: 37 additions & 3 deletions
diff --git a/‎ojd_daps_skills/__init__.py
Lines changed: 2 additions & 1 deletion b/‎ojd_daps_skills/__init__.py
Lines changed: 2 additions & 1 deletion
diff --git a/‎ojd_daps_skills/configs/extract_skills_esco.yaml
Lines changed: 1 addition & 0 deletions b/‎ojd_daps_skills/configs/extract_skills_esco.yaml
Lines changed: 1 addition & 0 deletions
@@ -26,38 +26,23 @@ To install as a package:
 pip install ojd-daps-skills
 ```
 
-Note: If you are using a conda environment you may need to do `conda install scipy` before pip installing this library.
+> 🐍 **NOTE:** If you are using a conda environment you may need to do `conda install scipy` before pip installing this library.
+
+> ⏳ **NOTE:** The first time you import `SkillsExtractor` in python it will take some time (around a minute) to load.
 
 To extract skills from a job advert:
 
 ```
 from ojd_daps_skills.extract_skills.extract_skills import SkillsExtractor
 
-sm = SkillsExtractor(taxonomy_name="toy")
-
-✘ nestauk/en_skillner NER model not loaded. Downloading model...
-Collecting en-skillner==any
-  Downloading https://huggingface.co/nestauk/en_skillner/resolve/main/en_skillner-any-py3-none-any.whl (587.7 MB)
-     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 MB 5.1 MB/s eta 0:00:0000:0100:01
-Installing collected packages: en-skillner
-Successfully installed en-skillner-3.7.1
-✘ Multi-skill classifier not loaded. Downloading model...
-Fetching 4 files: 100%|██████████| 4/4 [00:00<00:00, 26843.55it/s]
-✘ Neccessary data files are not downloaded. Downloading ~0.5GB of
-neccessary data files to
-/Users/india.kerlenesta/Projects/nesta/ojd_daps/ojd_daps_extension/ojd_daps_skills/ojd_daps_skills_data.
-ℹ Data folder downloaded from
-/Users/india.kerlenesta/Projects/nesta/ojd_daps/ojd_daps_extension/ojd_daps_skills/ojd_daps_skills_data
+sm = SkillsExtractor(taxonomy_name="toy") # Can also use "esco" or "lightcast" here
 
 job_ads = [
     "The job involves communication skills and maths skills",
     "The job involves Excel skills. You will also need good presentation skills",
     "You will need experience in the IT sector.",
 ]
 job_ad_with_skills = sm(job_ads)
-
-ℹ Getting embeddings for 3 texts ...
-ℹ Took 0.018199920654296875 seconds
 ```
 
 To access the extracted and mapped skills for each inputted job advert:
@@ -78,15 +63,15 @@ Which returns:
 
 ```
 Job advert: The job involves communication skills and maths skills
-Entities found: [('communication skills', 'SKILL'), ('maths', 'SKILL')]
-Skill spans: [communication skills, maths]
-Skills mapped: [{'ojo_skill': 'communication skills', 'ojo_skill_id': 3144285826919113, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S1'}, {'ojo_skill': 'maths', 'ojo_skill_id': 2887431344496880, 'match_skill': 'working with computers', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S5'}]
+Entities found: [('communication skills', 'SKILL'), ('maths skills', 'SKILL')]
+Skill spans: [communication skills, maths skills]
+Skills mapped: [{'ojo_skill': 'communication skills', 'ojo_skill_id': 3144285826919113, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S1'}, {'ojo_skill': 'maths skills', 'ojo_skill_id': 1654958883999821, 'match_skill': 'working with computers', 'match_score': 0.6666666666666666, 'match_type': 'most_common_level_1', 'match_id': 'S5'}]
 
 
 Job advert: The job involves Excel skills. You will also need good presentation skills
 Entities found: [('Excel', 'SKILL'), ('presentation skills', 'SKILL')]
 Skill spans: [Excel, presentation skills]
-Skills mapped: [{'ojo_skill': 'Excel', 'ojo_skill_id': 2576630861021310, 'match_skill': 'use spreadsheets software', 'match_score': 0.7379249448453751, 'match_type': 'skill', 'match_id': 'abcd'}, {'ojo_skill': 'presentation skills', 'ojo_skill_id': 1846141317334203, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.5, 'match_type': 'most_common_level_1', 'match_id': 'S1'}]
+Skills mapped: [{'ojo_skill': 'Excel', 'ojo_skill_id': 2576630861021310, 'match_skill': 'use spreadsheets software', 'match_score': 0.7379249334335327, 'match_type': 'skill', 'match_id': 'abcd'}, {'ojo_skill': 'presentation skills', 'ojo_skill_id': 1846141317334203, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.5, 'match_type': 'most_common_level_1', 'match_id': 'S1'}]
 
 
 Job advert: You will need experience in the IT sector.
 
@@ -26,44 +26,27 @@ You can use pip to install the library:
 
 `pip install ojd-daps-skills`
 
-Note: If you are using a conda environment you may need to do `conda install scipy` before pip installing this library.
+> 🐍 **NOTE:** If you are using a conda environment you may need to do `conda install scipy` before pip installing this library.
 
-Note that this package was developed on MacOS and tested on Ubuntu. Changes have been made to be compatible on a Windows system but are not tested and cannot be guaranteed.
-
-When the package is first used it will automatically download a folder of neccessary data and models (~1GB).
+> 💻 **NOTE:** This package was developed on MacOS and tested on Ubuntu. Changes have been made to be compatible on a Windows system but are not tested and cannot be guaranteed.
 
 ## TL;DR: Using Nesta’s Skills Extractor library
 
+> ⏳ **NOTE:** The first time you import `SkillsExtractor` in python it will take some time (around a minute) to load.
+
 To extract skills from a job advert:
 
 ```
 from ojd_daps_skills.extract_skills.extract_skills import SkillsExtractor
 
-sm = SkillsExtractor(taxonomy_name="toy")
-
-✘ nestauk/en_skillner NER model not loaded. Downloading model...
-Collecting en-skillner==any
-  Downloading https://huggingface.co/nestauk/en_skillner/resolve/main/en_skillner-any-py3-none-any.whl (587.7 MB)
-     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 MB 5.1 MB/s eta 0:00:0000:0100:01
-Installing collected packages: en-skillner
-Successfully installed en-skillner-3.7.1
-✘ Multi-skill classifier not loaded. Downloading model...
-Fetching 4 files: 100%|██████████| 4/4 [00:00<00:00, 26843.55it/s]
-✘ Neccessary data files are not downloaded. Downloading ~0.5GB of
-neccessary data files to
-/Users/india.kerlenesta/Projects/nesta/ojd_daps/ojd_daps_extension/ojd_daps_skills/ojd_daps_skills_data.
-ℹ Data folder downloaded from
-/Users/india.kerlenesta/Projects/nesta/ojd_daps/ojd_daps_extension/ojd_daps_skills/ojd_daps_skills_data
+sm = SkillsExtractor(taxonomy_name="toy") # Can also use "esco" or "lightcast" here
 
 job_ads = [
     "The job involves communication skills and maths skills",
     "The job involves Excel skills. You will also need good presentation skills",
     "You will need experience in the IT sector.",
 ]
 job_ad_with_skills = sm(job_ads)
-
-ℹ Getting embeddings for 3 texts ...
-ℹ Took 0.018199920654296875 seconds
 ```
 
 To access the extracted and mapped skills for each inputted job advert:
@@ -84,15 +67,15 @@ Which returns:
 
 ```
 Job advert: The job involves communication skills and maths skills
-Entities found: [('communication skills', 'SKILL'), ('maths', 'SKILL')]
-Skill spans: [communication skills, maths]
-Skills mapped: [{'ojo_skill': 'communication skills', 'ojo_skill_id': 3144285826919113, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S1'}, {'ojo_skill': 'maths', 'ojo_skill_id': 2887431344496880, 'match_skill': 'working with computers', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S5'}]
+Entities found: [('communication skills', 'SKILL'), ('maths skills', 'SKILL')]
+Skill spans: [communication skills, maths skills]
+Skills mapped: [{'ojo_skill': 'communication skills', 'ojo_skill_id': 3144285826919113, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.75, 'match_type': 'most_common_level_1', 'match_id': 'S1'}, {'ojo_skill': 'maths skills', 'ojo_skill_id': 1654958883999821, 'match_skill': 'working with computers', 'match_score': 0.6666666666666666, 'match_type': 'most_common_level_1', 'match_id': 'S5'}]
 
 
 Job advert: The job involves Excel skills. You will also need good presentation skills
 Entities found: [('Excel', 'SKILL'), ('presentation skills', 'SKILL')]
 Skill spans: [Excel, presentation skills]
-Skills mapped: [{'ojo_skill': 'Excel', 'ojo_skill_id': 2576630861021310, 'match_skill': 'use spreadsheets software', 'match_score': 0.7379249448453751, 'match_type': 'skill', 'match_id': 'abcd'}, {'ojo_skill': 'presentation skills', 'ojo_skill_id': 1846141317334203, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.5, 'match_type': 'most_common_level_1', 'match_id': 'S1'}]
+Skills mapped: [{'ojo_skill': 'Excel', 'ojo_skill_id': 2576630861021310, 'match_skill': 'use spreadsheets software', 'match_score': 0.7379249334335327, 'match_type': 'skill', 'match_id': 'abcd'}, {'ojo_skill': 'presentation skills', 'ojo_skill_id': 1846141317334203, 'match_skill': 'communication, collaboration and creativity', 'match_score': 0.5, 'match_type': 'most_common_level_1', 'match_id': 'S1'}]
 
 
 Job advert: You will need experience in the IT sector.
 
@@ -23,9 +23,43 @@ For further information or feedback please contact Liz Gallagher, India Kerle or
 - Out of scope is extracting and matching skills from job adverts in non-English languages; extracting and matching skills from texts other than job adverts; drawing conclusions on new, unidentified skills.
 - Skills extracted should not be used to determine skill demand without expert steer and input nor should be used for any discriminatory hiring practices.
 
-## Metrics
+## Metrics - The model trained on data from 8th August 2023 (correct as of 29th May 2025)
 
-There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares. The analysis in this section was performed using the results of the `20220825` model. We believe the newer `20230808` model will improve these results, but the analysis hasn't been repeated.
+There is no exact way to evaluate how well our pipeline works; however we have several proxies to better understand how our approach compares.
+
+### Evaluation 2 - Manual judgement of skills extraction and mapping quality
+
+We manually tagged a random sample of skills extracted from job adverts, with whether we thought they were inappropriate, OK or excellent skill entities, and whether we thought they had inappropriate, OK or excellent matches to ESCO skills (or other parts of the taxonomy).
+
+- We felt that out of 202 skill entities 73% were excellent entities, 17% were OK and 10% were inappropriate.
+- 192 of the 202 skill entities were matched to ESCO skills or parts of the taxonomy.
+- Of the 192 matched skills, we felt 45% were excellently matched, 27% were OK and 27% were inappropriate.
+- Of the 96 skills matched to ESCO skills, we felt 71% were excellently matched, 24% were OK and 5% were inappropriate.
+
+| Skill entity quality | ESCO match quality | count |
+| -------------------- | ------------------ | ----- |
+| Inappropriate        | Inappropriate      | 18    |
+| Inappropriate        | OK                 | 3     |
+| OK                   | Inappropriate      | 15    |
+| OK                   | OK                 | 15    |
+| OK                   | Excellent          | 4     |
+| Excellent            | Inappropriate      | 19    |
+| Excellent            | OK                 | 36    |
+| Excellent            | Excellent          | 92    |
+
+- 89% of the matches were to either an individual skill or the lowest level of the skills taxonomy (level 3).
+- The match quality is at its best when the skill entity is matched to an individual ESCO skill.
+
+| Taxonomy level mapped to | Number in sample | Average match quality score (0-inappropriate, 1-OK, 2-excellent) |
+| ------------------------ | ---------------- | ---------------------------------------------------------------- |
+| Skill                    | 96               | 1.66                                                             |
+| Skill hierarchy level 3  | 84               | 0.70                                                             |
+| Skill hierarchy level 2  | 7                | 1                                                                |
+| Skill hierarchy level 1  | 5                | 0.40                                                             |
+
+## Metrics - The model trained on data from 25th August 2022
+
+> ⚠️ **NOTE:** The analysis in this section was performed using the results of the `20220825` model. We believe the newer `20230808` model will improve these results, but the analysis hasn't been repeated apart from 'Evaluation 2' discussed above.
 
 ### Comparison 1 - Top skill groups per occupation comparison to ESCO essential skill groups per occupation
 
@@ -93,5 +127,5 @@ We manually tagged a random sample of skills extracted from job adverts, with wh
 | Skill hierarchy level 3  | 51               | 0.90                                                             |
 | Attitudes hierarchy      | 8                | 1.63                                                             |
 | Skill hierarchy level 2  | 6                | 0.33                                                             |
-| Knoweldge hierarchy      | 6                | 0.17                                                             |
+| Knowledge hierarchy      | 6                | 0.17                                                             |
 | Transversal hierarchy    | 1                | 1.00                                                             |
@@ -3,6 +3,7 @@
 import warnings
 from pathlib import Path
 from typing import Optional
+import importlib.resources
 
 import yaml
 from spacy.tokens import Doc
@@ -27,8 +28,8 @@ def get_yaml_config(file_path: Path) -> Optional[dict]:
 
 bucket_name = "open-jobs-lake"
 
-PUBLIC_DATA_FOLDER_PATH = PROJECT_DIR / "ojd_daps_skills_data"
 PUBLIC_MODEL_FOLDER_PATH = PROJECT_DIR / "ojd_daps_skills_models"
+PACKAGE_PATH = importlib.resources.files("ojd_daps_skills")
 
 
 def setup_spacy_extensions():
 
@@ -1,4 +1,5 @@
 taxonomy_name: "esco"
+taxonomy_version: "v_1_1_1"
 num_hier_levels: 4
 skill_type_dict:
   {
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`taxonomy_name: "esco"`
	`2`	`+taxonomy_version: "v_1_1_1"`
`2`	`3`	`num_hier_levels: 4`
`3`	`4`	`skill_type_dict:`
`4`	`5`	`{`