Skip to content

Conversation

ngachchi
Copy link
Contributor

@ngachchi ngachchi commented Jun 25, 2025

What does this PR do ?

This PR introduces Hindi Text Normalization 2.0, which features substantial accuracy improvements across multiple classes and the addition of a new Telephone class. It also integrates culturally relevant linguistic constructs to enhance natural language understanding.


Accuracy Improvements by Class:

Text Class Accuracy
Cardinal 89.86%
Money 99.04%
Time 92.61%
Date 87.18%
Ordinal 38.46%
Decimal 100.00%
Fraction 99.66%
Measure 96.26%
Telephone 99.13%
Address 18.99%

Key Enhancements:

  • New Class: Telephone

    • Added support for STD codes and Indian country code (+91) in landline number normalization.
    • Classification based on keywords for pincode and the last four digits of credit cards to extend within the telephone class.
  • Linguistic Enrichment:

    • Incorporated quarterly terms frequently used in Hindi for Fraction, Measure, and Time classes:
      • Savva (सवा), Saadhe (साढ़े), Ponne (पौने), Dedh (डेढ़), Dhai (ढाई)

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

mgrafu and others added 4 commits April 24, 2025 11:20
* Future Implementations for classes - Measure, Money, and Date (NVIDIA#258)

* Future Implementations for classes - Measure, Money, and Date

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* Resolved the conflicts with mm_yyyy and date ranges and added the previously removed failing test cases.

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* removed the unused empty string implementation

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fixes for the tagger files

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reformatted decimal final graph

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* incorporated the suggestion for decimal graph

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Century implementations

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* Working on the yyyy format for the date class

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* reverted yyyy code

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* working on future implementations

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* working on improving the date class accuracy

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added year prefix for the date class

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* working on the commma cases for date class

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* minor fixes

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* implemented mixed fractions

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* rectified the test case

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* working on quarterly measurements

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* reformatted the prefixes and suffixes for date tagger class

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* replaced text tag with era tag for the date class

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

* Removed the text tag reference from date class verbalizer

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>

---------

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update jenkins cache

Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Potential fix for code scanning alert no. 821: Unused local variable

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com>

---------

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
Signed-off-by: Mariana Graterol Fuenmayor <marianag@nvidia.com>
Signed-off-by: Mariana <47233618+mgrafu@users.noreply.github.com>
Co-authored-by: Namrata Gachchi <ngachchi@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
ngachchi and others added 2 commits June 26, 2025 10:03
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
@ngachchi ngachchi marked this pull request as ready for review July 1, 2025 11:15
Copy link
Contributor Author

@ngachchi ngachchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgrafu, could you please review this PR?

ngachchi and others added 2 commits July 7, 2025 16:34
…e telephone class

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
@mgrafu mgrafu changed the base branch from main to staging_hi_tn July 7, 2025 17:44
ngachchi and others added 2 commits July 8, 2025 09:27
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
ngachchi and others added 2 commits July 9, 2025 09:57
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
@@ -0,0 +1,8 @@
२ दो
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mapping any different than cardinals (lines 1-4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These refer to the validation of landline numbers starting with specific digits within India.

Copy link
Collaborator

@mgrafu mgrafu Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is the mapping any different from cardinals? if not, please import from cardinals and restrict the accepted inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the additional mappings from the TSV file and integrated them into the existing TSV files, as you suggested.

@@ -0,0 +1,8 @@
६ छह
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mapping any different than cardinals (lines 1-4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These refer to the validation of mobile numbers starting with specific digits within India.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is the mapping any different from cardinals? if not, please import from cardinals and restrict the accepted inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the additional mappings from the TSV file and integrated them into the existing TSV files, as you suggested.

@@ -0,0 +1,20 @@
० शून्य
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mapping any different than cardinals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Hindi digits, no, it's actually the same as cardinal single digits. But for English digits, yes, it's just a common resource for telephone class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please use cardinal for Hindi digits and filter the inputs you need, and only add a file for English digits in that case? let's avoid repetition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sure, I've updated the same

@@ -0,0 +1,100 @@
० एक
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mapping any different than cardinals (lines 1-4)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually 0.75 is converted to a quarter, so zero is mapped to one in paune_mappings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want a data file that is 100 lines -- please reuse cardinal when applicable or reapply with rules elsewhere

ngachchi and others added 2 commits July 14, 2025 11:05
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
def __init__(self):
super().__init__(name="telephone", kind="classify")

mobile_number = generate_mobile(["नंबर", "मोबाइल", "फोन", "कॉल"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these inputs be part of a tsv file instead of hardcoding them here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sure, I've removed these inputs and converted them to respective tsv files

Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
Signed-off-by: Namrata Gachchi <ngachchi@nvidia.com>
Copy link

github-actions bot commented Aug 8, 2025

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 8, 2025
Copy link

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Aug 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants