Oxford Handbook of Voice Perception


Voice perception research is now a lively area of research, which is studied from many different perspectives ranging from basic research on the acoustic analysis of vocalizations and the neural and cognitive mechanisms, to comparative research across ages, species, and cultures, up to applied research in the field of machine-based generation and decoding of voices, telecommunication, psychiatry, and neurology.

The Oxford Handbook of Voice Perception, edited by Sascha Frühholz and Pascal Belin, provides a comprehensive and authoritative overview on all the major research fields related to voice perception, in an accessible form, for a broad readership of students, scholars, and researchers. The handbook is composed of 40 chapters divided into seven major parts, each of which deals with a central perspective on voice perception, with hundreds of sound and video examples to illustrate the different perspectives on voice perception. Sound and video examples can be found below:



  1. The Science of Voice PerceptionSascha Frühholz and Pascal Belin

1.1 ‘Buzzing’ produced by periodic oscillations of the vocal folds in the larynx.

1.2 Different phonemes in human speech.

  1. Ancient of Days: The Vocal Pattern as Primordial Big Bang of Communication – Diana Van Lancker Sidtis

2-1 Frogs, Leipzig, Germany.

2.2 Sarcastic/ sincere utterances: 2-2a. 2-2a. American English; 2-2b. 2-2b.  Korean

2.3 Word accent contrasts: (a) American English: 2-3a. noun, 2-3a. noun phrase (green house, yellow jacket); 2-3a. noun, 2-3a. verb (import, convict); (b) Swedish: 2-3b. duck, spirit.

2.4 Ditropic utterances: (a) American English: 2-4a. idiomatic, literal; (b) Korean:2-4b. idiomatic, 2-4b. literal; (c) French: 2-4c. idiomatic, 2-4c. literal.

2-5. Sentence focus.

  1. The ‘Vocal Brain’: Core and Extended Cerebral Networks for Voice Processing – Pascal Belin

3.1 Voice without speech.

3.2 Example of vocal block in the voice localizer.

3.3 Example of non- vocal block in the voice localizer.

3.4 Male voice average.

3.5 Female voice average.

  1. Acoustic Properties of Infant- Directed Speech  – Yuanyuan Wang, Derek M. Houston, and Amanda Seidl

5.1 Infant- directed speech, showing prosodic modifications: (a) 5.1a English, 3 months old; (b) 5.1b English, 9 months old; (c) 5.1c English, 12 months old; (d) 5.1d English, 15 months old; (e) 5.1e English, 18 months old; (f) 5.1f Mandarin, 21 months old.

5.2 Adult- directed speech, showing prosodic modifications: (a) 5.2a English, 3 months old; (b) 5.2b English, 9 months old; (c) 5.2c English, 12 months old; (d) 5.2d English, 15 months old; (e) 5.2e English, 18 months old; (f) 5.2f Mandarin, 21 months old.

  1. The Singing VoiceJohan Sundberg

6.1 Example of the voice source of a real voice, derived by eliminating the effects of the vocal tract on the sound.

6.2 Stimuli that received the highest and the lowest mean ratings of pressedness in the study by Millgård et al. (2015).

6.3 Typical example of a register break in a male voice.

6.4 Synthesized tone, first presented without, and then with, vibrato.

6.5 Synthesized vowel with vibrato rates of 7, 6, and 5 Hz, presented twice, first with an extent of ±0.5 semitone, then with an extent of ±1 semitone.

6.6 Timbral effect of clustering formants so as to create a singer’s formant cluster in a synthesized vowel. First, formants 3, 4, 5, and 6 are at 2500 Hz, 3700 Hz, 4900 Hz, and 5500 Hz. Then, the fourth is changed to 2700 Hz; then the fifth is changed to 2900 Hz; and finally the sixth is changed to 3500 Hz.

6.7 Examples of the Rock, Soul, Pop, and Dance Band styles of singing analysed in the study by Zangger- Borch and Sundberg (2012).

6.8 Timbral effects of successively adding formant and voice- source characteristics to a synthesized tone. Each example is presented three times: (a) synthesis of a neutral, non- twang style; (b) as (a) but first formant changed to the value found in twang; (c) as (b) but second formant changed to the value found in twang; (d) as (c) but third formant changed to the value found in twang; (e) as (d) but fourth formant changed to the value found in twang; (f) as (e) but fifth formant changed to the value found in twang.

  1. Reconsidering the Nature of VoiceJody Kreiman and Bruce R. Gerratt

8.1 Six synthetic voices created with Kreiman et al.’s psychoacoustic model, along with the corresponding original, natural voice samples: female 1 (a) 8.1a original (b) 8.1b synthetic; female 2 (c) 8.1c original (d) 8.1d synthetic; female 3 (e) 8.1e original; (f) 8.1f synthetic; male 1 (g) 8.1g original (h) 8.1h synthetic; male 2 (i) 8.1i original (j) 8.1j synthetic; male 3 (k) 8.1k original (l) 8.1lsynthetic.



  1. One Step Beyond: Musical Expertise and Word LearningStefan Elmer, Eva Dittinger, and Mireille Besson

10.1 Four consonant- vowel (CV) syllable stimuli: (a) 10.1a /ka/ natural German CV syllable; (b) 10.1b /da/ natural German CV syllable; (c) 10.1c /wnka/ reduced- spectrum analogue of the German /ka/ syllable; (d) 10.1d /wnda/ reduced- spectrum analogue of the German /da/ syllable.

12.  Neural Responses to Infant Vocalizations in Adult Listeners – Katherine S. Young, Christine E. Parsons, Alan Stein, Peter Vuust, Michelle G. Craske, and Morten L. Kringelbach

12.1 Infant and adult vocalizations: (a) 12.1a continuous cry burst from a healthy 6- month- old infant; (b) 12.1b cry sounds from a female adult (approx. 25– 30 years of age); (c) 12.1c burst of laughter from a healthy 6- month- old infant. 



  1. Comparative Perspectives on Communication in Human and Non- Human Primates: Grounding Meaning in Broadly Conserved Processes of Voice Production, Perception, Affect, and CognitionAlan K.S. Nielsen and Drew Rendall

13.1 Synthetic human speech stimuli used in experiments testing the classic bouba- kiki effect: (a) 13.1a beebee- kookah; (b) 13.1b boobah- keekee; (c) 13.1c leemay- tahkoo. 

13.2 Typical primate vocalizations given in either a context of high arousal and excitement or in a calm, affiliative context: (a)– (c) 13.2a  13.2b 13.2c primate barks; (d)– (f) 13.2d 13.2e 13.2f primate coos; (g)– (i) 13.2g 13.2h 13.2i primate grunts; (j)– (l) 13.2j  13.2k 13.2l primate screams. 

  1. Linking Vocal Learning to Social Reward in the Brain: Proposed Neural Mechanisms of Socially Guided Song LearningSamantha Carouso-Peck and Michael H. Goldstein

14.1 Subsong recorded from a juvenile male zebra finch, shortly after sensorimotor phase onset at 40 dph. 

14.2 Plastic song recorded from same individual as in Audio 14.1, at the end of the sensory phase at 60 dph. 

14.3 Crystallized song recorded from same individual as in Audio 14.1, at adulthood at 90 dph.

    15. Voice- Sensitive Regions, Neurons, and Multisensory Pathways in the Primate BrainCatherine Perrodin and Christopher I. Petkov

15.1 VIDEO Rhesus macaque monkey producing a ‘coo’ vocalization. Video 15.1
15.2 VIDEO Rhesus macaque monkey (same individual as in Video 15.1) producing a ‘grunt’ vocalization. Video 15.2

  1. Voice Perception across SpeciesAttila Andics and Tamás Faragó

16.1 Sounds from different mammal species, varying in valence and arousal: wild boar (a) 16.1a positive state (b) 16.1b negative state; tree shrew (c) 16.1c positive state (d) 16.1d negative state; Przewalski’s horse (e) 16.1e positive state, low pitch (f) 16.1f positive state, high pitch (g) 16.1g negative state, low pitch (h) 16.1g negative state, high pitch; pig (i) 16.1i positive state (j) 16.1j negative state; human (k) 16.1k positive state (l) 16.1l negative state; domestic horse (m) 16.1m positive state (n) 16.1n negative state; goat (o) 16.1o positive state, high pitch (p) 16.1p negative state, high pitch; dog (q) 16.1q positive state (r) 16.1r negative state; chimpanzee (s) 16.1s positive state, laughter (t) 16.1t negative state, scream.

16.2 Sounds from different bird species, varying in valence and arousal: raven (a) 16.2a positive state (b) 16.2b negative state; kea (c) 16.2c positive state (d) 16.2d negative state; ibis (e) 16.2e positive state, greeting (f) 16.2f negative state, threat. 

16.3 Sounds from different mammal species, all negative state, varying in intensity: wild boar (a) 16.3a low pitch (b) 16.3b high pitch; tree shrew (c) 16.3c low- pitch squeak, close distance (d) 16.3d low- pitch squeak, long distance; pig (e) 16.3e low pitch (f) 16.3f high pitch; kitten (g) 16.3g low pitch, isolated state (h) 16.3h low pitch, handling state; horse (i) 16.3i low pitch (j) 16.3j high pitch; cattle (k) 16.3k low pitch (l) 16.3l high pitch; silver fox (m) 16.3m low pitch (n) 16.3n high pitch. 

  1. Emotional and Social Communication in Non-Human AnimalsCharles T. Snowdon

17.1 Tamarin mobbing call: (a) 17.1a at normal speed; (b) 17.1b slowed to human auditory range. Note the noisy dissonant structure. 

17.2 Tamarin confident threat: (a) 17.2a at normal speed; (b) 17.2b slowed to human auditory range. Note the harmonic structure of the call. 

17.3 Chorused tamarin long call, used to bring group members together. Note the long harmonic notes and antiphonal calling. 

17.4 Calming music for tamarins. (Copyright: David Teie) 

17.5 Arousing music for tamarins. (Copyright: David Teie) 

  1. Dual Stream Models of Auditory Vocal CommunicationJosef P. Rauschecker

18.1 Vocalizations of rhesus macaques (to accompany Figure 18.1A): (a) 18.1a archscream; (b) 18.1b coo; (c) 18.1c growl; (d) 18.1d tonal scream; (e) 18.1e harmonic arch; (f) 18.1f bark. (Original description of species- specific vocalizations in the macaque by Hauser, 1996.)

18.2 Band- passed noise bursts (to accompany Figure 18.2A): (a) 18.2a pure- tone burst; (b) 18.2b 0.333 octaves; (c) 18.2c 0.5 octaves; (d) 18.2d 1 octave; (e) 18.2e 2 octaves; (f) 18.2f white- noise burst. 



  1. The Electrophysiology and Time Course of Processing Vocal Emotion ExpressionsSilke Paulmann and Sonja A. Kotz

20.1 Successively building fragments of a pseudo- sentence conveying anger, as used in a typical emotional prosody gating study, together with the full sentence: (a) 20.1a gate 1; (b) 20.1b gate 2; (c) 20.1c gate 3; (d) 20.1d gate 4; (e) 20.1e gate 5; (f) 20.1f gate 6; (g) 20.1g full sentence.

20.2 Successively building fragments of a pseudo- sentence conveying fear, as used in a typical emotional prosody gating study, together with the full sentence: (a) 20.2a gate 1; (b)  20.2b gate 2; (c) 20.2c gate 3; (d) 20.2d gate 4; (e) 20.2e gate 5; (f) 20.2f gate 6; (g) 20.2g full sentence.

20.3 Successively building fragments of a pseudo- sentence conveying happiness, as used in a typical emotional prosody gating study, together with the full sentence: (a) 20.3a gate 1; (b) 20.3b gate 2; (c) 20.3c gate 3; (d) 20.3d gate 4; (e) 20.3e gate 5; (f) 20.3f gate 6; (g) 20.3g full sentence.

  1. Amygdala Processing of Vocal EmotionsJocelyne C. Whitehead and Jorge L. Armony

21.1 Non- linguistic vocalizations: (a) 21.1a sadness, male; (b) 21.1b sadness, female; (c) 21.1c pleasure, male; (d) 21.1d pleasure, female; (e) 21.1e neutral, male; (f) 21.1f neutral, female; (g) 21.1g happiness, male; (h) 21.1h happiness, female; (i) 21.1i fear, male; (j) 21.1j fear, female.

21.2 Pseudospeech: (a) 21.2a surprise, male; (b) 21.2b surprise, female; (c) 21.2c sad, male; (d) 21.2d sad, female; (e) 21.2e neutral, male; (f) 21.2f neutral, female; (g) 21.2g happy, male; (h) 21.2h happy, female; (i) 21.2i fear, male; (j) 21.2j fear, female; (k) 21.2k disgust, male; (l) 21.2l disgust, female; (m) 21.2m anger, male; (n) 21.2n anger, female.

  1. Laughing Out Loud! Investigations on Different Types of LaughterKai Alter and Dirk Wildgruber

22.1 Different types of laughter: (a) 22.1a tickling; (b) 22.1b taunting; (c) 22.1c sniffing; (d) 22.1d schadenfreude; (e) 22.1e through nasal cavity; (f) 22.1f nasal cavity open; (g) 22.1g laughter phrases; (h) 22.1h joyful/ friendly; (i) 22.1i inhalation and voiced; (j) 22.1j inhalation and exhalation; (k) 22.1k fricative; (l) 22.1l cough- like.



  1. Perceiving Speaker Identity from the VoiceStefan R. Schweinberger and Romi Zäske

24.1 Parameter- specific voice morphing: examples of adaptor types used in Skuk et al. (2015). Adaptor types differ with respect to the acoustic parameters that have been morphed along a male– female gender continuum: (a) 24.1a F0 adaptor type with male/ androgynous/ female F0 contour and with androgynous timbre; (b) 24.1b timbre adaptor type with male/ androgynous/ female timbre and with androgynous F0 contour; (c) 24.1c full adaptor type with male/ androgynous/ female F0 contour and timbre; (d) 24.1d voice gender continuum of full morphs (F0 and timbre) spanning from male to female in steps of 10%.

24.2 Voice samples of male and female, young and old speakers uttering the German sentence ‘Die nachfrage bestimmt den preis’. [Demand determines the price.] Stimuli for young speakers are taken from Zäske et al. (2014), and all samples are taken from the Jena Speaker Set Database: (a) 24.2a male, 75 years old; (b) 24.2b male, 68 years old; (c) 24.2c male, 65 years old; (d) 24.2d male, 25 years old; (e) 24.2e male, 23 years old; (f) 24.2f male, 21 years old; (g) 24.2g female, 71 years old; (h) 24.2h female, 65 years old; (i) 24.2i female, 64 years old; (j) 24.2j female, 23 years old; (k) 24.2k female, 19 years old; (l) 24.2l female, 18 years old.

24.1 VIDEO Stimuli taken from Schweinberger, Kloth, and Robertson (2011) show two identical videos of a male speaker onto which either his own voice (a) video 24.1a or another speaker’s voice (b) video 24.1b was dubbed. The audio and video tracks were initially time- standardized with respect to word onsets and overall duration such that the voice of speaker B is in time synchrony with speaker A’s facial movements.

  1. Perceptual Correlates and Cerebral Representation of Voices— Identity, Gender, and AgeMarianne Latinus and Romi Zäske

25.1 Acoustic cues for identity perception: 25.1a  25.1b 25.1c 25.1d  four different, natural, female voices uttering the same syllable; 25.1e  female voice prototype, corresponding to the average of thirty- two voices.

  1. The Perception of Personality Traits from VoicesPhil McAleer and Pascal Belin

26.1 A trustworthy sounding male voice demonstrates a slightly higher pitch than average: compare (a) 26.1a trustworthy male, with (b) 26.1b untrustworthy male. For females, trustworthiness seems to relate to the glide of the voice: compare (c) 26.1c rising intonation of untrustworthy female voice, with (d) 26.1d dropping intonation of trustworthy female voice.
26.2 Dominance is a strong influence of formant dispersion, with lower dispersion associated with higher dominance: compare (a) 26.2a male, dominant, with (b) 26.2b male, non- dominant; and (c) 26.2c female, dominant, with (d) 26.2d female, non- dominant.
26.3 (a) 26.3a Male Trustworthiness (PC1) Continuum, from Low To High; (b) 26.3b Female Trustworthiness (PC1) Continuum, from Low To High; (c) 26.3c Male Dominance (PC2) Continuum, from Low To High; (d) 26.3d Female Dominance (PC2) Continuum, from Low To High.

  1. Voices in the Context of Human Faces and BodiesBenjamin Kreifelts and Thomas Ethofer

29.1 VIDEO stimuli taken from Kreifelts et al. (2007) exemplifying different multimodal (voice/ face) non- verbal emotional expressions: (a) 29.1a angry non- verbal expression, female speaker; (b) 29.1b neutral non- verbal expression, male speaker; (c) 29.1c happy non- verbal expression, male speaker. The videos depict actors speaking single words with an emotional expression, using an autoinduction technique. The actors are wearing headcaps obscuring all nonface parts of the speaker. For further details see Kreifelts et al. (2007).

  1. Linguistic ‘First Impressions’: Accents as a Cue to Person PerceptionPatricia E.G. Bestelmeyer

30.1 Three sample accents used in the neuroimaging study by Bestelmeyer et al. (2015):  Scottish accent; Southern English accent; and General American accent.



  1. Voice MorphingHideki Kawahara and Verena G. Skuk

31.1 31.1a  31.1b  31.1c 31.1d 31.1e Morphed voices of a German sentence ‘Keine antwort ist auch eine antwort’ [‘No answer is an answer as well’] resulting from the interpolation of a 22- year- old male speaker with a 71- year- old male speaker.

31.2 Parameter morph continua between a male and a female / aba/ : (a) 31.2a full morph continuum; (b) 31.2b F0 morph continuum; and (c) 31.2c timbre morph continuum.

31.3 Morphs resulting from interpolations and extrapolations of 25- year- old male speaker with an average voice. Morph continuum ranges from the anti- voice with an identity strength of −1 to a caricature of 1.5 identity strength.


  1. Voice and Speech Synthesis— Highlighting the Control of ProsodyKeikichi Hirose

35.1 Japanese TTS conversion; formant synthesizer with control of prosodic features using generation process model.
35.2 Japanese emotional speech by HMM- based speech synthesizer with prosodic features generated using generation process model.
35.3 Japanese synthetic speech with emphasis controlled by generation process model. Sentence: ‘arayuru geNjitsuo subete jibuNnohooe nejimagetanoda’ [(He) twisted all the realities to his side].
35.4 Male to female voice conversion with prosodic features controlled by generation process model. Sentence: ‘chiisana unagiyani nekkino yoonamonoga minagiru’ [A small eel shop is filled with a kind of hot air].
35.5 HMM- based speech synthesis with prosodic features controlled using generation process model- based F0 contours. Sentence: ‘hyoogeNsuru nooryokuo minitsukeru kotodearu’ [It is to obtain an ability of expressing].
35.6 HMM- based speech synthesis with prosodic features controlled using generation process model- based F0 contours. Sentence: ‘teNimuhoo oorakanamonoda’ [(He) is such flawless, natural and generous (person)].