Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier
Abstract
Recordings gathered with child-worn devices promised to revolutionize both fundamental and applied speech sciences by allowing the effortless capture of children's naturalistic speech environment and language production. This promise hinges on speech technologies that can transform the sheer mounds of data thus collected into usable information. This paper demonstrates several obstacles blocking progress by summarizing three years' worth of experiments aimed at improving one fundamental task: Voice Type Classification. Our experiments suggest that improvements in representation features, architecture, and parameter search contribute to only marginal gains in performance. More progress is made by focusing on data relevance and quantity, which highlights the importance of collecting data with appropriate permissions to allow sharing.
This repo contains the script needed to train the Whisper-VTC models, perform inference on a set of audio files and evaluate the models given ground-truth annotations.
Ensure that you have uv installed on you system.
Clone the repo and setup dependencies:
git clone git@github.com:LAAC-LSCP/VTC-IS-25.git
cd VTC-IS-25
uv sync
The audio files for inference simply needs to lie in a simple repository, the inference script will load them automatically.
For training, the data is loaded in a specific way using segma and its SegmaFileDataset class. In short the data needs to have the following structure:
dataset_name/
βββ rttm/
β βββ 0000.rttm
β βββ 0001.rttm
βββ uem/ (optional)
β βββ 0000.uem
β βββ 0001.uem
βββ wav/
β βββ 0000.wav
β βββ 0001.wav
βββ train.txt
βββ val.txt
βββ test.txt
βββ exclude.txt (optional)
Where train.txt
, val.txt
and test.txt
are list of unique identifiers that link the audios (.wav) to their annotation (.rttm).
The structure of a RTTM or a UEM file is best described in pyannote-database.
Before anything can be trained or infered, you'll need to download the weights of the pre-trained Whisper small model using the save_load_whisper.py
scripts.
uv run scripts/save_load_whisper.py --model small
The config.yml
contains most of the variables needed for training. Refer to segma
for more informations on training a model.
uv run scripts/train.py \
--config config/config.yml \
Inference is done using a checkpoint of the model, linking to the corresponding config file used for the trianing and the list of audio files to run thee inference on.
uv run scripts/infer.py \
--config model/config.yml \
--wavs audios \
--checkpoint model/best.ckpt \
--output predictions
Simply specify the input folder and output folder.
For more fine-grained tuning, use the min-duration-on-s
and min-duration-off-s
parameters.
uv run scripts/merge_segments.py \
--folder rttm_folder \
--output rttm_merged
To perform inference and speech segment merging (see merge_segments.py for help or this pyannote.audio description), a single bash script is given.
Simply set the correct variables in the script and run it:
sh scripts/run.sh
@misc{kunze2025challengesautomatedprocessingspeech,
title={Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier},
author={Tarek Kunze and Marianne MΓ©tais and Hadrien Titeux and Lucas Elbert and Joseph Coffey and Emmanuel Dupoux and Alejandrina Cristia and Marvin Lavechin},
year={2025},
eprint={2506.11074},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2506.11074},
}
This work uses the segma library which is heavely inspired by pyannote.audio.
The first version of the Voice Type Classifier is available here.