Lehr- und Forschungsgebiet Phoniatrie, Pädaudiologie
und Kommunikationsstörungen

Bernd J. Kröger: Research

Neural model of speech processing (production, perception, acquisition)

See also: Wikipedia: Neurocomputational Speech Processing
Our model is developed under the paradigms of theoretical neuroscience or systemic neuroscience

1 ) A time-explicit spiking neuron model for speech production and speech perception using the NENGO neural modeling framework (see: www.nengo.ca)

The architecture of the neural model of speech processing (see references up to 2023) including location of buffers of the cognitive and sensorimotor processing part within the brain
The architecture of the neural model of speech processing developed using the NENGO.ai-approach (see references up to 2020)


2 ) A conventional connectionist SOM and GSOM neural model for production, perception and acqusition of speech is developed on the basis of two different articulatory-acoustic models

The architecture of the model of speech processing using classsical connectionist approaches (see publications below)


Neural model for sentence processing

simple approach: parsing and producing of SPO-sentences using a linear parser, programmed using NENGO
here: the ipynb source code for parsing: ipynb
here: the ipynb source code for production: ipynb

next goal: programming a nonlinear parser, capable of processing complex sentences

Simulation of production mechanisms of voice, reeds and brass

Simulation models for (a) reeds, (b) brass, and (c) voice (see publications below)


Classification of voice quality

Production-Perception Approach (PPA): Stimuli are generated using a self-oszillating two-mass model (see: VocalTractLab)

Try to imitate the auditory stimuli, following the articulatory explanations (not for rough and instable voice)
  • pressurelow  mid  high  low2high  high2low     ... results in soft to loud voice (no or little change in voice quality)
  • tensionlow  mid  high  low2high  high2low     ... results in low to high pitch (no or little change in voice quality)
  • articulation vs. phonation:   aua(low)  aua(high)  aaa(low-high-low)  uuu(low-high-low)     ... results in changes of pitch or in changes of supralaryngeal articulation (no or little change in voice quality)
  • abduction:   low  mid  high  low2high  high2low     ... results in change of voice quality: from asthen (hypo) to strained (hyper)
  • noisy leak:   low  mid  high  low2high  high2low     ... results in change of voice quality: from no to full breathiness
  • tissueDamping4RoughVibs:   low  mid  high  low2high  high2low     ... results in change of voice quality: from full to no roughness
  • tissueDamping4InstabVibs:   low  mid  high  low2high  high2low     ... results in change of voice quality: instable voice because of low tissue damping
  • changeTension4InstabVibs:   low  mid  high  low2high  high2low     ... results in change of voice quality: instable voice because of changes in vocal fold tension

Classification of Voice Quality using GIRBAS Scale
(following: PH Dejonckere et al. 1996, Revue de Laryngolgie - Otologie - Rhinologie 117, 219-224)
(description of voice features: see also ASHA CAPE-V 2002: pdf)

four-point scale of severity: 0="normal/no", 1="slight", 2="moderate", 3="severe" need to be set for the following six features:


Speech acquisition: mental syllabary

Generation of mono-syllables (and bisyllabic words) of Standard German using an action-based articulatory production model (using: VocalTractLab) and speaker JD3.speaker
The rows of the table are ordered with respect to type of syllable (V, CV, CCV, ...) and type of vowels, consonant and consonant clusters (01, 02. 03, ...)
The colums 1 and 2 of the table reflect type of syllable; column 3: a typical gesture score; column 4: examples

[V] 01: V = long vowel, diphtong (incl. vowel plus vocalic /r/),
triphthong(= diphthong plus vocalic /r/)
well-timed speech actions: vocalic, glottal abduction, lung pressure action;
two actions for intonation were added;
the utterence ends with last lung pressure action
long: i: e: E: a: o: u: y: 2: @:
diphthong: aI aU OI
long+voc/r/: i:6 e:6 E:6 a:6 o:6 u:6 y:6 2:6
triphthong: aI6 aU6 OI6
articulation: V01_ges.zip
[V] 02: from V = long to V = short vowel and reduced vowel:
compared to V01: shorten vowel;
end of utterance is determined by last lung pressure action (0Pa); function of this action is like consonantal closure in syllable offset!
short and reduced vowel only occurr in CV context (see below; figure shows [d@])
audios for variation of V in /CV/: see above and below;
V = long, short, reduced
[CV] 01: from V to VC with C = voiced plosive, lateral, or /r/ as fricative realization:
compared to V01/V02: in addition: consonantal obstruction action
long: ba: da: ga: la: ra: bu: du: gu: lu: ru: bi: di: gi: li: ri:
diphthong: baI baU daU daI gaU gaI gOI laU raU rOI raI
long+voc/r/: bi:6 bE:6 di:6 dE:6 gi:6 le:6 ru:6
triphthong: baU6 baI6 raI6
short: ba da ga la ra
reduced: b@ d@ g@ l@ r@
words: ga:b@ ga:d@ la:g@ ba:r@ gal@ gar@
articulation: CV01_ges.zip
[CV] 02: from C = lateral to C = nasal:
compared to VC01: add a velopharyngeal opening action
long: ma: na: mi: ni: mu: nu:
diphthong: maU naU maI naI mOI nOI
long+voc/r/: mi:6 me:6 mE:6 mo:6 nE:6 na:6 nu:6
triphthong: maU6 maI6
short: ma na
reduced: m@ n@
words: da:m@ pan@
articulation: CV02_ges.zip
[CV] 03: from C = voiced to C = voiceless plosive:
compared to CV01: shift of consonantal obstruction action to the left in order to allow VOT;
add a gottal abduction action (logner duration of initial glottal opening action)
long: pa: ta: ka: pi: ti: ki: pu: tu: ku:
diphthong: paI paU pOI taI taU tOI kaI kaU kOI
long+voc/r/: pi:6 pu:6 ti:6 ty:6 te:6 to:6 tu:6 ke:6 ko:6 ku:6
triphthong: paU6 taU6 tOI6
short: pa ta ka
reduced: p@ t@ k@
words: kap@ pa:t@ bak@
articulation: CV03_ges.zip
[CV] 04: from C = voiceless plosive to C = voiceless fricative:
compared to CV03: longer duration of glottal opening action over whole consonatal obstruction;
slightly right shift of consonantal obstruction action (more temporal overlap with vocalic action)
long: fa: sa: Sa: ci: xa:
diphthong: faU faI SaU SaI SOI
long+voc/r/: fi:r fy:r fE:r fo:r fu:r Si:r Sy:r SE:r Sa:r So:r Su:r
triphtong: faIr fOIr SaUr
short: fa sa Sa ca xa
reduced: f@ s@ S@ c@ x@
words: ?af@ kas@ laS@ kYc@ la:x@
articulation: CV04_ges.zip
[CV] 05: from C = voiceless to C = voiced fricative:
compared to CV04: add a glottal adduction action (here called "breathy"),
leading to phonation and a leak in order to generate enough air flow
for frication noise during production of voiced fricative
long: va: za: Za: ja:
diphthong: vaU vaI zaU zaI jaU
triphtong: vaIr zaUr
short: va za Za ja reduced: v@ z@ Z@ j@
words: m2:v@ va:z@ ga:z@ ko:j@
articulation: CV05_ges.zip
[CV] 06: from V CV with C = glottal stop /?/:
compared to V01/V02: add a glottal stop action (strong adduction);
slightly later onset of lung pressure action (1000Pa)
long: ?i: ?a: ?u: ?y:
diphthong: ?aI ?OI ?aU
short/reduced: ?I ?a ?U ?Y ?@
articulation: CV06_ges.zip
[CV] 07: from /?V/ to /hV/ with /h/ = glottal voiceless fricative:
compared to CV04 (oral voiceless fricative): delete the oral consonantal obstruction action;
just: long glottal abduction action (voiceless) plus synchronous lung pressure action (1000Pa)
V=long: hi: ha: hu: hy:
diphthong: haI hOI haU
V=short/reduced: hI ha hU hY h@
words: ?e:(h)@ my:(h)@
articulation: CV07_ges.zip
[CCV] 01: from CV (CV01-CV03) to CCV01
with C1 = plosives and C2 = lateral, nasal, or /r/ as fricative realization:
mainly add one consonantal obstruction action;
other changes of velic, glottal, F0 and lung actions similar as in CV01-CV03
V=[a:]: bla: bra: gla: gna: gra: pla: pra: kla: kna: kra:
V=[i:]: bli: bri: gli: gni: gri: pli: pri: kli: kni: kri:
V=[u:]: blu: bru: glu: gnu: gru: plu: klu: knu: kru:
articulation: CCV01_ges.zip
[CCV] 02: from CCV01 to CCV02 with C1 = voiceless fricative or C1C2 are both voiceless:
changes of actions similar to CV01-CV05
V=[a:]: fla: Sla: Sna: pfa: Spa: Sta:
V=[i:]: fli: Sli: Sni: pfi: Spi: Sti:
V=[u:]: flu: Slu: Snu: pfu: Spu: Stu:
articulation: CCV02_ges.zip
[CVC] 01: from CV01...CV09 to CVC01 with C(final) = nasal or lateral:
overlap of consonantal closure action with last part of preceding vowel
and co-occurring velopharyngeal opening action in case of nasal
from CV01: ba:n da:m ga:n la:m ga:b@n ga:r@n gal@n ban bal dam dIl gan lam ran da:m@n pan@n m2:b@l
from CV02: mo:n ma:l man mOl na:m nIm
from CV03: pa:n pa:l kan kal ka:m tIm tEl
from CV04: faIn fal Sal SaUm ?af@n vaf@l lax@n vax@n vaf@n la:x@n ?a:x@n
from CV05: va:m va:n va:l van val vIl vOl jan m2:v@n va:z@n ko:j@n
from CV07: ?i:m ?Im ?i:n ?In ?a:m ?a:l ?am ?a:n ?an ?al
from CV08: ha:n hu:n ho:l ham hIm han hal hIn ?e:(h)@n my:(h)@n
articulation: CVC01_ges.zip
[CVC] 02: from CV01...CV09 to CVC02 with C(final) = voiceless plosive:
overlap of consonantal closure action with last part of preceding vowel
and co-occurring glottal opening action
from CV01: ba:t ba:k bak bu:k bu:p lo:t lo:p ra:t rUk
from CV02: ma:t mat na:t nEp
from CV03: pak kap pat
from CV04: fIt SIk SOk
from CV05: za:t zak zat vat jUp jUk
from CV07: ?a:t ?Et ?Ek ?ap
from CV08: ha:t hu:t hat hIk hak
articulation: CVC02_ges.zip
[CVC] 03: from CV01...CV09 to CVC02 with C(final) = voiceless fricative:
overlap of consonantal obstruction action with last part of preceding vowel
and co-occurring glottal opening action
from CV01: ga:s la:s laUs laIc das bas dax dOx lax rax laS raS ri:f raIf rIf dIc daIc rIc raIc lIc
from CV02: ma:s nas naS na:x mi:f mUf mUs mIc
from CV03: pas paS kES kEs tax tIS tu:x taIc ki:s
from CV04: fu:s fi:s fas fIS fES fUS Sax SaIc Si:f SIf
from CV05: vaIc vas vaS vax vIS vu:S zax zIc
from CV07: ?a:s ?aUf ?aIs ?aUs ?as ?ax ?aS ?uf ?Ox
from CV08: has haf huX haS haIs haUs hOIs hi:s
articulation: CVC03_ges.zip

Articulatory-acoustic synthesis of speech and singing

1 ) A fast 2D-articulatory-acoustic speech synthesizer using the vocal tract action control concept (Kröger et al. 2010, Cognitive Processing 11: 187-205, pdf) and based on our earlier geometrical model (Kröger et al. 2005, ZASPiL 40: 79-94, pdf) has been reprogrammed in Python. Currently we work on integration of this fast speech synthesizer into a neuroscience-based model of speech learning.
An older version of the synthesizer was based on a simple Köln articulatory model, but used the same acoustic model. (see: Kröger et al. 1993 Journal Phonetica, Kröger 1998 Habil.-Thesis)

Video examples:

Audio examples:

2 ) A 3D-articulatory-acoustic synthesizer including a gestural control concept has been developed for high quality synthesis of speech and singing.
Currently the model is capable of synthesizing unrestricted text including all sound types (vowels, plosives, fricatives, ...) and unrestricted songs for untrained male and female voices. (Birkholz & Kröger 2007: Abstracts of PEVOC, Groningen, Poster).



Apraxia of Speech (AOS) and Childhood Apraxia of Speech (CAS)

see pdf of an introductory lecture: AOS: The poorly understood speech disorder
Papers on motor planning by using speech action units started in 2010 (see publications)


Treatment of speech disorders using SpeechTrainer

SpeechTrainer is a software-package for 2D-visualisation of speech movements (download SpeTra). SpeechTrainer can be used as a visual stimulation technique in treatment of different types of speech disorders.(see Funk_2006, Kröger_2005)


Acoustic and perceptual methods in diagnosis of speech disorders

Phonetically oriented methods in diagnosis of speech disorders are mainly perceptually based. The main drawback of these methods is its subjectivity. Thus acoustically based methods could be advantageous. But the main problem of acoustically based methods in diagnosis of speech disorders is to extract meaningful or significant acoustic parameters.

Different phonetically oriented measures were tested for improving or refining the diagnosis of speech disorders: