Assistive Speech Technologies


Objectives

Assistive technology enables people to live independently and helps people work around challenges to learn, communicate, and function better. Technology greatly elevates the quality of living in every dimension.

Details

Technical details:

  • Assistive technology (AT) enables people to live independently and helps people work around challenges to learn, communicate, and function better.
  • The technology significantly elevates the quality of living in every dimension. 
  • The imprints of technology over assistive devices have been profoundly witnessed in recent years. 
  • Assistive technology promotes greater independence for people with disabilities by enabling them to accomplish tasks that are difficult or impossible otherwise. 
  • Assistive technologies involve providing universal access, such as modifications to appliances to make them accessible to visually challenged or persons with speech and hearing impairments. 
  • An important sub-discipline within the AT research community is Augmentative and Alternative Communication(AAC), which is focused on communication technologies for those with impairments that interfere with some aspect of human communication, including spoken or written modalities.
  • Speech technologies can be effectively used in AT/AAC in a large variety of ways, such as 
    • Improving (or enhancing) the intelligibility of unintelligible speech of disabled
    • Providing communicative assistance (devices) for individuals with severe motor disorders, such as 
      • Cerebral palsy,
      • Parkinson’s disease, 
      • Autism spectrum disorder.
  • The major requirement for all these technology developments is an annotated speech database of speakers affected by various speech disorders. 
  • In the Indian context, as there are not many publicly available disordered speech databases, the major objective of this research is to develop a speech database for articulatory disorder, specifically a dysarthric speech database for Tamil (Dravidian language), Hindi (Indo-Aryan language), and Indian English. 
  • Further, episodes of cry and children speech databases will be developed for one of the prevalent neurodevelopmental disorders, namely, Autism Spectrum Disorder (ASD). 
  • Episodes of cry are considered as it is expected to provide evidence for early detection of ASD. 
  • Furthermore, the severity of the disorder and speech communicative disabilities can be analyzed, and appropriate therapy can be suggested for children with ASD. 
  • Towards this end, a children (ageing from 6 to 10) speech database will be developed for the languages Tamil and Hindi.

Major objectives:

  • To develop an annotated dysarthric speech database in Tamil, Hindi, and Indian English for 20 speakers, each with varying degrees of the disorder.
  • To develop assessment and therapy tools for speech articulatory disorders
  • To develop ASR systems for dysarthric speakers
  • To develop TTS with adaptive speech rate for people with articulation disorders and visually challenged.
  • To develop an infant cry database for early detection of ASD and pathology classification.
  • To develop children’s speech database for ASD and articulation disorders
  • To develop children’s speech database for ASR with reference to consortium effort
  • To develop speech-based emotion recognition for disordered speech, such as dysarthria.

A.Assistive Technologies for Speakers with Articulatory Speech Disorders:

  • Speech is the natural form of verbal communication for human beings.
  • The process of speech production involves a phonetic plan and a motor plan.
  • The phonetic plan refers to the formulation of the message in the brain, whereas the motor plan refers to the process of uttering the formulated message using speech organs.
  • Any difficulty along the motor plan of speech production creates a condition called speech disorder.
  • This dysfunction not only inhibits a person from expressing their thoughts, needs, and desires but also limits their opportunities in education, employment, and recreation.
  • As per the survey conducted by the Government of India in 2011, speech disorder is listed as the fifth-highest disability, with a prevalence rate of 7.5% (Government of India) and among the spectrum of speech disorders, articulation disorders were found to have a higher occurrence rate of 18.7%.
  • Articulation disorder is caused due to improper control in the motor plan leading to an imprecise or irregular production of speech sound units resulting in unintelligible speech.
  • Dysarthria, a neuromotor speech articulatory disorder caused due to cerebral palsy or stroke, leads to an imprecise or irregular production of speech sound units.
  • Dysarthric speakers inadvertently substitute, insert, or delete sound units while speaking, making their speech unintelligible to the listeners.
  • The symptoms of dysarthria can range from mild slurring of speech sounds to complete inability to produce any intelligible words.
  • Based on the degree of severity, dysarthria can be classified as mild, moderate, moderate-to-severe, and severe.
  • Each dysarthric speaker is treated based on the cause, type, and severity of the disorder.
  • Since dysarthria is progressive (Cucchiara et al. 2020), early interventions, through speech therapies delivered by a speech-language therapist (SLT), can provide effective ways to regain their speech.
  • However, the efficacy of these treatments is dependent on the severity of the disorder and therefore, the outcome is subjective.
  • To provide objective and measurable therapy solutions, an automatic assessment tool using speech-based technologies will be developed.
  • To develop any assessment, therapy or assistive technologies for these speakers, the availability of a speech database is the foremost requirement.

1. Development of Speech Database for Speakers with Articulatory Speech Disorders:

  • Dysarthria, a prevalent articulatory speech disorder, can be classified into 
    • Spastic
    • Hyperkinetic
    • Hypokinetic 
    • Ataxic,
    • Flaccid
    • Mixed based on the quality of speech being rendered.
  • Of these, the most prevalent types of dysarthria are spastic and ataxia.
  • Dysarthria might affect any of the speech subsystems (laryngeal, velopharyngeal or articulatory) of speech production, making the speech unintelligible.
  • Depending on the type and severity of the disordered dysarthric individuals substitute, delete or insert speech sound units.
  • In order to analyze these characteristics of dysarthria and to develop therapy and speech-enabled assistive technologies, an annotated speech database is a major requirement.
  • To that effect, the following steps are involved in developing a disordered speech database.
    • Identification of disordered individuals with varying degrees of disorder
    • Ethical clearance from the organization and parents
    • Formulation of text with enough phone and word examples to build assistive devices
    • Data recording
    • Data preprocessing
    • Intelligibility assessment by SLPs
    • Data annotation (segmentation and labeling at phone, word and sentence level)
  • Aim:   
    • Develop a speech database in Tamil (Dravidian language), Hindi (Indo-Aryan language), and Indian English.
  • Speech data will be collected from 20 speakers in each language with varying degrees of disorder.
  • Around 30 minutes of speech data (including words, phrases, and sentences having around five words at the maximum) will be collected from each of the speakers.
  • Further, data annotation will be carried out at the phone, word, and sentence levels for further analysis.
  • The annotated data will be available for public use for research and academic purposes.

2. An automatic speech assessment and therapy tool:

  • To develop assistive speech devices for disordered speakers using various speech technologies such as ASR and TTS, understanding the variations in speech characteristics of disordered speech is an important aspect.
  • The acoustic knowledge and articulatory characteristics, when incorporated into the models trained, the performance of the system is expected to improve.
  • Towards this end, an automatic speech assessment system will be developed.
  • Proposed Research:
    • To understand the speech characteristics of dysarthria, based on the severity, a speech subsystem-based acoustic analysis will be carried out.
    • Variations in temporal and spectral characteristics will be analyzed thoroughly.
    • From the acoustic analysis, pronunciation (articulatory) errors based on place and manner of articulation can be identified.
    • Further, the ASR system trained using a normal speaker’s speech data (consortium effort) with minimum possible error rate and tested with dysarthric speech can be used as an assessment tool as the acoustic deviation between normal and dysarthric speech is expected to be reflected in the likelihood space.
    • Acoustic similarity-based metrics, such as Kullback-Leibler(KL) divergence, can be used to differentiate recognition errors (modelling errors) from the articulatory errors
    • The articulatory knowledge thus derived can be exploited in reducing the WER of severity-specific dysarthric ASR systems while building an assistive device.
    • An Acoustic-to-Articulatory Inversion (EMA dataset) can be performed as well to understand the articulatory behaviour of the speaker.
    • Furthermore, a ResNet model can be trained to categorize the disordered speakers based on their severity.
  • Based on the severity of the disordered speaker, the number of phones in error varies.
  • If a tool (web application or a stand-alone system) is developed for identifying and indicating articulatory errors automatically, it can be used as a learning module by the disordered speakers.
  • Deviations in pronunciation with reference to normal speech articulation can be indicated.
  • A likelihood ratio-based scoring system showing the degree of variation in pronunciations for known utterances can act as a therapy tool to score the articulatory errors and make the speaker correct the pronunciation.

3. ASR systems for dysarthric speakers

  • As articulatory error increases with severity, speech intelligibility across classes decreases.
  • Though the articulatory errors made by disordered speakers are speaker-specific, training a speaker-dependent ASR system for each of the speakers will be cumbersome and data-dependent.
  • However, on average, the characteristics of these speakers can be generalized based on their severity.
  • Therefore, if the severity of the speaker is known, severity-class-based ASR systems can be trained more than customized systems.
  • To achieve the same, the following techniques are proposed:
  • Proposed research:
    • ResNet-based severity classification systems will be explored to identify the category (mild, moderate, severe) the disordered speaker belongs.
    • Fixed and context-dependent substitution errors made by dysarthric speakers are more predominant than deletion and insertion errors.
    • These errors identified using the assessment tool can be corrected using various methods, such as text processing (finite state transducers) / biphone language models.
    • These methods will be explored in training ASR systems.
    • It is proposed to use multiresolution feature extraction and virtual microphone array synthesis for data augmentation techniques as disordered speech has very limited training data and collecting large amounts of speech data may cause fatigue to the speaker and will be time-consuming as well.
    • Further, it is also proposed to use CycleGAN for data augmentation tasks for dysarthric ASR.
    • Following the recent success of Wav2vec 2.0 for low resource ASR, in particular, optimal performance for 100 times less training data, we plan to investigate its use for dysarthric ASR.

4. Development of Speaker-Specific TTS for Articulatory Disorders and TTS with Adaptive Speech Rate for Visually Challenged

  • Augmentative and alternative speech assistive devices that handle disordered speech as input and provide intelligible speech as output are considered high-tech AAC devices.
  • These AAC devices require an ASR system, a text-processing module, and a TTS system that can produce speech in a disordered speaker’s own voice.
  • To retain the identity of the dysarthric speaker, in TTS, speaker-adaptation / voice conversion techniques can be carried out.
  • The speech rate of a dysarthric speaker varies across severity. 
  • Jitter and shimmer may be observed in a few speakers as well.
  • Speech rate analysis across different classes of dysarthric speakers will help in modifying the duration models in TTS to improve intelligibility.
  • A speech rate/prosody modification system as a post-processing module can be introduced to improve intelligibility.
  • As part of MeitY TTS Phase-II Consortium, the consortium has made efforts to integrate screen readers for the visually challenged, in particular, integration of NVDA/ORCA with TTS.
  • However, it was observed that most visually challenged people wanted to hear speech at different speeds (ideally adaptively changing speed)
  • To that effect, we propose to use signal processing methods such as pitch, tempo, and time-scale modification.
  • Given the potential of GANs for voice conversion, the relevance of GANs for TTS will be explored.

 

B.Assistive Technologies for Autism Spectrum Disorder (ASD) Individuals

  • Impairment in social communication and social skills have been considered core symptoms of autism spectrum disorder (ASD).
  • In particular, ASD is a neurodevelopmental disorder that affects approximately 1 in 68 individuals, and it is characterized by a deficit in social communication and restricted, repetitive behaviour.
  • Some of the major characteristics of autism are abnormal speech patterns.
  • Children with autism are often non-verbal when initially diagnosed.
  • If the individuals with ASD are verbal, the uttered speech is highly deviant and has a limited communication function.
  • The speech of many children with autism appears abnormal and is often described as machine-like, “monotonic,” or “sing-song.”
  • The abnormalities were even noted in early descriptions of autism.
  • However, their exact characteristics, underlying mechanisms, consistency, and diagnostic power had not yet been established.
  • Earlier studies on abnormal speech patterns focused on prosody or abnormal supra-segmental aspects of speech production. 
  • Further studies have been conducted to make the distinction in delayed speech acquisition by developing speech spectrum-based measures, such as the long-term average spectrum. 
  • Some of the areas where studies have been conducted include: 
    • Long-term average spectrum analysis
    • Pitch analysis
    • Noise-level estimation
    • Spectral variability.
  • A number of abnormal speech patterns in autism have been identified, including 
    • Echolalia
    • Pronoun reversal
    • Metaphorical language
    • Poor grammatical structure
    • Atonality 
    • Arrhythmia.
  • Studies also have shown that abnormal speech patterns in autism are reflected in pitch variability and spectral content.
  • The variability may be an indication of abnormal processing of auditory feedback or instability in the mechanisms that control pitch.
  • Autistic children tend to repeat certain words and phrases over and over.
  • These phrases are often quite basic in nature.
  • Many people with autism are unable to label objects or use and understand abstract concepts until quite late, if at all.
  • People diagnosed with autism normally use idiosyncratic speech that makes little sense to those who are not familiar with them.
  • These individuals also use odd tones, where the speeches are characterized by rises at the end of sentences that are monotonous.
  • They may also use irregular intonation, pitch, pace, rhythm, and articulation.
  • Some of them also find it difficult to change their volumes irrespective of where they are.

5. To Develop an Infant Cry Database for Early Detection and Analysis of ASD and

Pathology Classification

  • Many of the core criteria used to diagnose ASD fully emerge after the first two years of life, making early identification/diagnosis challenging.
  • Even after appropriate diagnosis and treatment, many individuals with ASD who develop high functioning autism have persistent difficulties in aspects of pragmatics and vocal communication, despite years of empirically supported intervention targeting expressive and receptive language and communication. 
  • Moreover, there is no biological test or medical test that can identify ASD.
  • It is proposed to develop an infant cry database for early detection and analysis of ASD and other pathological conditions, such as asthma, asphyxia, deaf, and larynx not developed. 
  • Moreover, asthma and asphyxia are very prevalent in the Indian context. 
  • Collecting an infant cry database requires permission from the hospital authorities and parents as well. 
  • Getting cry signals of infants suffering from pathology is furthermore difficult. 
  • Getting a statistically significant infant cry corpus is a challenging task. 
  • Most of the researchers working in this area have their own database with different
    • sets of infant cry types
    • recording conditions
    •  microphones
    • age groups
    • different pathologies 
    • different weights of infants. 
  • In pathological cry analysis, cry characteristics change with the severity of the disease (pathology). 
  • In such cases, long-term follow-up of the infant is required, which is a time-consuming and difficult task. 
  • All these effects altogether pose a challenge to the researchers. 
  • During infant cry data collection, the following aspects should be considered:
    • The participation of infants must be voluntary in nature. No one can be forced to be a part of this study.
    • Before participation in the data collection, parents must be informed about the purpose of the study and the method of data collection. Written consent must be obtained from the parents.
    • Data should be collected in the presence of medical practitioners to get their feedback on the research work for normal vs pathological cases
    • The cry utterance should be long for the study of reasons for crying.
    • In the study of pathological cries, the reason for crying should be the same. 
    • The reason for crying will give changes in acoustic features in addition to the changes in acoustic features due to the presence of pathology.
  • Forcing air through the vocal tract and over the larynx produces a cry. 
  • The process of controlling the air passing through the larynx is regulated, through the cranial nerves, by the brainstem and limbic system, functions of which are thought to be compromised in individuals with ASD. 
  • Assessment of the spectrographic characteristics of crying can give investigators important information about the function of those brain areas that are involved in the pathogenesis of ASD.
  • Ten distinct cry modes-reflecting pattern of variations of fundamental frequency (Fo) or pitch and its harmonics from the narrowband spectrogram (as shown in Fig. 1) for normal infant cry. 
  • Anomalies in the Fo during a crying episode, such as dysphonation and hyperphonation in children with ASD, are expected to indicate the anomaly and will be significant in detecting the disorder earlier. 
  • To carry out this research, signal processing-based approaches using narrowband spectrographic representations showing horizontal striations can be used.
  • It is proposed to explore the significance of Constant-Q Transform (CQT) and its cepstral representation, i.e., Constant-Q Cepstral Coefficients (CQCC) in infant cry classification and its relation with early warning signs of individuals with ASD. 
  • Further, the classification of other pathological conditions, such as asphyxia, speech disorder due to cerebral palsy, asthma and asphyxia-based infant cry analysis as an early warning assessment tool will be explored.
  • The Constant-Q Transform has a variable spectro-temporal resolution in the time-frequency plane.
  • In addition, the analysis window function used in CQT is a function of both time and frequency parameters, and hence, it helps to achieve the form-invariance property-a desirable attribute of feature descriptors for pattern classification in the spectral domain, which is not possible to achieve by the traditional Short-Time Fourier Transform (STFT).
  • Thus, the cry modes from the traditional spectrogram will not obey the form-invariance property in the spectral domain and hence, may not be effective for infant cry classification tasks.

6. Development of Children's Speech Database for ASD and Articulation Disorders:

  • As a measure to analyze language acquisition and speech communicative disorders at an early stage and to suggest appropriate therapy for ASD speakers and persons with articulatory disorders, an annotated speech database is required. 
  • Further, speech data from normal-speaking children will also be collected for comparison in Tamil, Hindi, and Indian English.
  • Children’s vocal tract systems are not developed fully (including differences in various stages of language acquisition), and the acoustic characteristics of their speech are significantly different from those of adults. 
  • Speaker characteristics vary greatly, depending on 
    • Age-group 
    • Health
    • Place of origin
    • Language as well as language proficiency. 
  • Due to high fundamental or pitch frequency in children’s speech, the vocal tract spectrum gets quasi-periodically sampled by distantly-spaced pitch source harmonics, resulting in poor spectral resolution. 
  • Since state-of-the-art ASR systems predominantly use spectral features (such as MFCC or Mel filterbank energies), the performance of children's ASR is significantly degraded. 
  • Since children are still in the learning phase, for the age group of 6-10 years, pronunciation of language may degrade. 
  • Children also tend to stutter and pause more as compared to adults. 
  • This further degrades the performance of Children's ASR.
  •  A separate database, if collected and annotated, may help in building assistive devices for children with speech disorders. 
  • To collect children’s speech databases, establishing contacts with Children’s hospitals will be the initial task, followed by text formulation, getting consent from the parents/ guardians for data collection and speech data collection.
  • Speech data collection from children with ASD, as with any other disordered speech data collection, needs ethical clearance and consent from their parents or guardians. 
  • Since children’s vocal tract systems are not developed fully (including differences in various stages of language acquisition), the acoustic characteristics of their speech are significantly different from those of adults.
  • Due to high fundamental or pitch frequency in children’s speech, the vocal tract spectrum gets quasi-periodically sampled by distantly-spaced pitch source harmonics, resulting in poor spectral resolution.
  • Due to the poor performance of ASR techniques for children's speech which performed well on adult speech, many speech enhancement and data augmentation have been explored. 
  • Few researchers have tried to expand the pronunciation dictionary to offset the effects of mispronounced words in children’s speech. 
  • Some self-supervised approaches have also been adapted for the task of children's ASR.

7. To Develop Speech-Based Emotion Recognition for Disordered Speech

  • The speech signal is an important way to examine the subtle difference in prosody production (e.g., stress marking, discourse structure, stylistic components of speech) in individuals with ASD. 
  • A thorough analysis of temporal and spectral characteristics of speech will be conducted for the assessment of the disorder. 
  • Further, as facial expressions and prosody in speech showing emotions are lacking in people with ASD, a thorough analysis of these characteristics will be conducted. 
  • Furthermore, the speech of people with ASD is monotonous or sometimes sing-song in nature.
  •  A thorough analysis of emotions of disordered speech is expected to provide more information on the pathological conditions of speech. 
  • The current gold standard diagnostic tools for ADS are Autism Diagnostic Observation Schedule 2 nd Edition (ADOS-2) and Autism Diagnostic Interview-Revisited (ADI-R). 
  • However, none of these instruments is sufficient for diagnosis.
  • As per the recent study, social communication difficulties in autism spectrum disorder may involve deficits in cross-modal coordination, i.e., dynamic relation between speech production and facial expression.
  • The adult brain responds to the acoustic characteristics of infant cries. 
  • An fMRI-based study measured brain activity during adult processing of cries of infants with ASD and of matching control infants. 
  • For ASD infant cries, in addition to higher activations in the primary and secondary auditory cortex, higher levels of activation were observed in the inferior frontal gyrus, a region known to play a critical role in processing emotional information and evaluating the affective salience of speech. 
  • This suggests that listening to the cries of infants with ASD calls for deeper and more effortful auditory attention and comprehension and, in particular, comprehension of “emotional content,” which may be compromised in the cries of infants with ASD. 
  • Thus, this original investigation indicates a strong need for understanding emotional content from the voice of ASD individuals, and we believe similar reasoning may hold for understanding the emotions of disabled persons.
  • Further, there has been significant work for speech emotion recognition; however, the emotions (such as happy, sad, angry, and anxiety) are many times simulated (e.g., SUSAS emotion recognition database) as opposed to true emotions revealed by people with disability. 
  • Understanding the emotion of people with disabilities using speech is very significant as this may help speech and language therapists for therapeutic interventions. 
  • To that effect, we propose to work on speech emotion recognition for disabled people and analyze the performance against traditional speech emotion recognition, which is a quite mature research area in speech literature. 
  • In particular, we plan to investigate Teager Energy Operator TEO)-based features for emotion recognition tasks.


Organisations

   Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar
About :- Dhirubhai Ambani Institute of Information and Communication Technology

   Sri Sivasubramaniya Nadar College of Engineering
About :- Sri Sivasubramaniya Nadar College of Engineering
Information  
Demo CLICK HERE