Skip to content
Closed
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ pipeline {
HY_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/03-12-24-0'
MR_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/03-12-24-1'
JA_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/10-17-24-1'
HI_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/04-22-25-0'
HI_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-25-25-0'
DEFAULT_TN_CACHE='/home/jenkinsci/TestData/text_norm/ci/grammars/06-08-23-0'
}
stages {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@ h घंटे
min मिनट
doz दर्जन
yr साल
yr वर्ष
hp हॉर्सपॉवर
d दिन
month महीना
months महीने
हफ़्ते हफ़्ते
हफ़्ते
सप्ताह
सदियां
सदियों
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,6 @@ KHz किलोहर्ट्ज़
N न्यूटन
dB डेसीबल
yr साल
yr वर्ष
hp हॉर्सपॉवर
d दिन
month महीना
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
९१ नौ एक
91 नौ एक
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
नंबर
कार्ड
क्रेडिट
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
नंबर
मोबाइल
फोन
लैंडलाइन
कॉल
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
२ दो
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mapping any different than cardinals (lines 1-4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These refer to the validation of landline numbers starting with specific digits within India.

Copy link
Collaborator

@mgrafu mgrafu Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is the mapping any different from cardinals? if not, please import from cardinals and restrict the accepted inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the additional mappings from the TSV file and integrated them into the existing TSV files, as you suggested.

३ तीन
४ चार
६ छह
2 दो
3 तीन
4 चार
6 छह
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
नंबर
मोबाइल
फोन
कॉल
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
६ छह
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mapping any different than cardinals (lines 1-4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These refer to the validation of mobile numbers starting with specific digits within India.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but is the mapping any different from cardinals? if not, please import from cardinals and restrict the accepted inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the additional mappings from the TSV file and integrated them into the existing TSV files, as you suggested.

७ सात
८ आठ
९ नौ
6 छह
7 सात
8 आठ
9 नौ
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
0 शून्य
1 एक
2 दो
3 तीन
4 चार
5 पाँच
6 छह
7 सात
8 आठ
9 नौ
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
नंबर
पिन
कोड
पिनकोड
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
० एक
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this mapping any different than cardinals (lines 1-4)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually 0.75 is converted to a quarter, so zero is mapped to one in paune_mappings.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want a data file that is 100 lines -- please reuse cardinal when applicable or reapply with rules elsewhere

१ दो
२ तीन
३ चार
४ पाँच
५ छह
६ सात
७ आठ
८ नौ
९ दस
१० ग्यारह
११ बारह
१२ तेरह
१३ चौदह
१४ पंद्रह
१५ सोलह
१६ सत्रह
१७ अठारह
१८ उन्नीस
१९ बीस
२० इक्कीस
२१ बाईस
२२ तेईस
२३ चौबीस
२४ पच्चीस
२५ छब्बीस
२६ सत्ताईस
२७ अट्ठाईस
२८ उनतीस
२९ तीस
३० इकतीस
३१ बत्तीस
३२ तैंतीस
३३ चौंतीस
३४ पैंतीस
३५ छत्तीस
३६ सैंतीस
३७ अड़तीस
३८ उनतालीस
३९ चालीस
४० इकतालीस
४१ बयालीस
४२ तैंतालीस
४३ चौवालीस
४४ पैंतालीस
४५ छियालीस
४६ सैंतालीस
४७ अड़तालीस
४८ उनचास
४९ पचास
५० इक्यावन
५१ बावन
५२ तिरेपन
५३ चौवन
५४ पचपन
५५ छप्पन
५६ सत्तावन
५७ अट्ठावन
५८ उनसठ
५९ साठ
६० इकसठ
६१ बासठ
६२ तिरेसठ
६३ चौंसठ
६४ पैंसठ
६५ छियासठ
६६ सड़सठ
६७ अड़सठ
६८ उनहत्तर
६९ सत्तर
७० इकहत्तर
७१ बहत्तर
७२ तिहत्तर
७३ चौहत्तर
७४ पचहत्तर
७५ छिहत्तर
७६ सतहत्तर
७७ अठहत्तर
७८ उनासी
७९ अस्सी
८० इक्यासी
८१ बयासी
८२ तिरासी
८३ चौरासी
८४ पचासी
८५ छियासी
८६ सत्तासी
८७ अट्ठासी
८८ नवासी
८९ नब्बे
९० इक्यानबे
९१ बानबे
९२ तिरानबे
९३ चौरानबे
९४ पंचानबे
९५ छियानबे
९६ सत्तानबे
९७ अट्ठानबे
९८ निन्यानबे
९९ एक सौ
16 changes: 10 additions & 6 deletions nemo_text_processing/text_normalization/hi/taggers/cardinal.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@

class CardinalFst(GraphFst):
"""
Finite state transducer for classifying cardinals, e.g.
-२३ -> cardinal { negative: "true" integer: "तेइस" } }
s
Args:
deterministic: if True will provide a single transduction option,
for False multiple transduction are generated (used for audio-based normalization)
Finite state transducer for classifying cardinals, e.g.
-२३ -> cardinal { negative: "true" integer: "तेइस" }

Args:
deterministic: if True will provide a single transduction option,
for False multiple transduction are generated (used for audio-based normalization)
"""

def __init__(self, deterministic: bool = True, lm: bool = False):
Expand All @@ -37,6 +37,10 @@ def __init__(self, deterministic: bool = True, lm: bool = False):
teens_ties = pynini.string_file(get_abs_path("data/numbers/teens_and_ties.tsv"))
teens_and_ties = pynutil.add_weight(teens_ties, -0.1)

self.digit = digit
self.zero = zero
self.teens_and_ties = teens_and_ties

def create_graph_suffix(digit_graph, suffix, zeros_counts):
zero = pynutil.add_weight(pynutil.delete("०"), -0.1)
if zeros_counts == 0:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,7 @@ class DecimalFst(GraphFst):
def __init__(self, cardinal: GraphFst, deterministic: bool = True):
super().__init__(name="decimal", kind="classify", deterministic=deterministic)

graph_digit = pynini.string_file(get_abs_path("data/numbers/digit.tsv"))
graph_digit |= pynini.string_file(get_abs_path("data/numbers/zero.tsv"))

graph_digit = cardinal.digit | cardinal.zero
cardinal_graph = cardinal.final_graph

self.graph = graph_digit + pynini.closure(insert_space + graph_digit).optimize()
Expand Down
37 changes: 34 additions & 3 deletions nemo_text_processing/text_normalization/hi/taggers/fraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from pynini.lib import pynutil

from nemo_text_processing.text_normalization.hi.graph_utils import GraphFst
from nemo_text_processing.text_normalization.hi.utils import get_abs_path


class FractionFst(GraphFst):
Expand Down Expand Up @@ -47,13 +48,43 @@ def __init__(self, cardinal, deterministic: bool = True):
)
self.denominator = pynutil.insert("denominator: \"") + cardinal_graph + pynutil.insert("\"")

self.graph = (
dedh_dhai_graph = pynini.string_map([("१ १/२", "डेढ़"), ("२ १/२", "ढाई")])

savva_numbers = cardinal_graph + pynini.cross(" १/४", "")
savva_graph = pynutil.insert("सवा ") + savva_numbers

sadhe_numbers = cardinal_graph + pynini.cross(" १/२", "")
sadhe_graph = pynutil.insert("साढ़े ") + sadhe_numbers

paune = pynini.string_file(get_abs_path("data/whitelist/paune_mappings.tsv"))
paune_numbers = paune + pynini.cross(" ३/४", "")
paune_graph = pynutil.insert("पौने ") + paune_numbers

graph_dedh_dhai = pynutil.insert("morphosyntactic_features: \"") + dedh_dhai_graph + pynutil.insert("\" ")

graph_savva = pynutil.insert("morphosyntactic_features: \"") + savva_graph + pynutil.insert("\" ")

graph_sadhe = pynutil.insert("morphosyntactic_features: \"") + sadhe_graph + pynutil.insert("\" ")

graph_paune = pynutil.insert("morphosyntactic_features: \"") + paune_graph + pynutil.insert("\" ")

final_graph = (
self.optional_graph_negative
+ pynini.closure(self.integer + pynini.accep(" "), 0, 1)
+ self.numerator
+ self.denominator
)

weighted_graph = (
final_graph
| pynutil.add_weight(graph_dedh_dhai, -0.2)
| pynutil.add_weight(graph_savva, -0.1)
| pynutil.add_weight(graph_sadhe, -0.1)
| pynutil.add_weight(graph_paune, -0.2)
)

self.graph = weighted_graph

graph = self.graph
final_graph = self.add_tokens(graph)
self.fst = final_graph.optimize()
graph = self.add_tokens(graph)
self.fst = graph.optimize()
55 changes: 50 additions & 5 deletions nemo_text_processing/text_normalization/hi/taggers/measure.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,9 @@ def __init__(self, cardinal: GraphFst, decimal: GraphFst):
super().__init__(name="measure", kind="classify")

cardinal_graph = (
digit
| teens_and_ties
cardinal.zero
| cardinal.digit
| cardinal.teens_and_ties
| cardinal.graph_hundreds
| cardinal.graph_thousands
| cardinal.graph_ten_thousands
Expand All @@ -52,6 +53,7 @@ def __init__(self, cardinal: GraphFst, decimal: GraphFst):
point = pynutil.delete(".")
decimal_integers = pynutil.insert("integer_part: \"") + cardinal_graph + pynutil.insert("\"")
decimal_graph = decimal_integers + point + insert_space + decimal.graph_fractional

unit_graph = pynini.string_file(get_abs_path("data/measure/unit.tsv"))
quarterly_units_graph = pynini.string_file(get_abs_path("data/measure/quarterly_units.tsv"))

Expand Down Expand Up @@ -93,10 +95,50 @@ def __init__(self, cardinal: GraphFst, decimal: GraphFst):
+ unit
)

graph_quarter = (
dedh_dhai = pynini.string_map([("१.५", "डेढ़"), ("२.५", "ढाई")])
dedh_dhai_graph = pynutil.insert("integer: \"") + dedh_dhai + pynutil.insert("\"")

savva_numbers = cardinal_graph + pynini.cross(".२५", "")
savva_graph = pynutil.insert("integer: \"सवा ") + savva_numbers + pynutil.insert("\"")

sadhe_numbers = cardinal_graph + pynini.cross(".५", "")
sadhe_graph = pynutil.insert("integer: \"साढ़े ") + sadhe_numbers + pynutil.insert("\"")

paune = pynini.string_file(get_abs_path("data/whitelist/paune_mappings.tsv"))
paune_numbers = paune + pynini.cross(".७५", "")
paune_graph = pynutil.insert("integer: \"पौने ") + paune_numbers + pynutil.insert("\"")

graph_dedh_dhai = (
pynutil.insert("cardinal { ")
+ optional_graph_negative
+ dedh_dhai_graph
+ pynutil.insert(" }")
+ delete_space
+ units
)

graph_savva = (
pynutil.insert("cardinal { ")
+ optional_graph_negative
+ savva_graph
+ pynutil.insert(" }")
+ delete_space
+ units
)

graph_sadhe = (
pynutil.insert("cardinal { ")
+ optional_graph_negative
+ sadhe_graph
+ pynutil.insert(" }")
+ delete_space
+ units
)

graph_paune = (
pynutil.insert("cardinal { ")
+ optional_graph_negative
+ quarter_graph
+ paune_graph
+ pynutil.insert(" }")
+ delete_space
+ units
Expand Down Expand Up @@ -135,9 +177,12 @@ def __init__(self, cardinal: GraphFst, decimal: GraphFst):

graph = (
pynutil.add_weight(graph_decimal, 0.01)
| pynutil.add_weight(graph_quarter, 0.005)
| pynutil.add_weight(graph_cardinal, 0.01)
| pynutil.add_weight(graph_exceptions, 0.01)
| pynutil.add_weight(graph_dedh_dhai, 0.001)
| pynutil.add_weight(graph_savva, 0.005)
| pynutil.add_weight(graph_sadhe, 0.005)
| pynutil.add_weight(graph_paune, -0.2)
)
self.graph = graph.optimize()

Expand Down
Loading