ASR for Extremely Low-Resource Indic Languages


Objectives

Building of ASR system for Extremely Low-Resource Indic Languages. We will focus on three highly under-resourced Indian languages that have no open speech datasets like Konkani, Maithili, and Santali.

Details

Automatic Speech Recognition (ASR):

  • Has made impressive strides in recent years and is seeing widespread adoption in various applications.
  • Existing ASR systems require large amounts of labeled speech in order to perform competitively, creating a dichotomy between high-resource languages (with access to large labeled speech corpora) and low-resource or under-resourced languages (with minimal access to labeled speech data).
  • Most of the Indian languages are, unfortunately, low-resource with respect to ASR technologies.
  • A recent study on open voice data in Indian languages reveals that the top ten Indian languages based on the amount of labeled speech averages less than 250 hours, and the remaining Indian languages have little to nil amounts of accompanying labeled speech.
  • Focuses on:
    • Three highly under-resourced Indian languages have no open speech datasets.
      • Konkani-Indo-Aryan languages
      • Maithili -Indo-Aryan languages
      • Santali-  Austroasiatic language
  • According to the 2011 census,
    • Konkani has around 2.2 million native speakers
    • Santali has 7.3 million native speakers 
    • Maithili has 13.5 million native speakers
  • These languages have native speakers hailing from multiple Indian states, including Goa, Bihar, Jharkhand, Karnataka, Maharashtra, Assam, Mizoram, Odisha, Tripura and West Bengal.

Data Characteristics:

  • Unlike almost all the Indian language speech datasets that focus on read speech, we will aim to collect speech in an interview style from native speakers resulting in more spontaneous speech for all three languages. 
  • While prompt-based data collection is commonly employed to collect read speech, this could lead to challenges where prompts created by speakers of a specific dialect are unfamiliar to speakers of a different dialect of the same language.
  • Interview-style speech data collection allows us to bypass such concerns and also potentially collect speech in many dialects. 
  • Spontaneous speech is also very rarely available as a style of speech in existing speech datasets for Indian languages, thus making it a rich resource.
  • For data collection, we will also investigate crowdsourcing for a speech from native speakers via an Android app called CLAP, which was developed by one of the PIs as part of an IMPRINT-2 grant. 

Aim:

  • To collect 300 hours of transcribed conversational speech in each language. (As a frame of reference, the most popular conversational speech dataset for English Switchboard consists of roughly 200 hours of speech.) 
  • Will focus on the agricultural domain for all three languages.
  • Evaluate our target languages using existing pretrained multilingual models (e.g. XLSR)
  • Develop 
    • Adaptation techniques to finetune the pretrained models with labeled data in the target languages.
    • Self-supervision techniques that help the models perform better on conversational-style speech in our target languages.
    • Transfer learning approaches that help utilize conversational speech in one target language for the other target languages.
  • Curated datasets consisting of around 300 hours of conversational speech in Konkani, Maithili and Santali
  • Web APIs for domain-specific ASR in all three languages serve as initial prototypes.
  • Android apps catering to each domain-specific ASR will be further refined with user feedback.



 



Organisations

   Indian Institute of Science, Bangalore
About :- IISc Bangalore
Information  
DEMO CLICK HERE

   Indian Institute of Technology, Bombay
About :- IIT Bombay