Text-to-speech and the Uncanny Valley

Text-to-speech and the Uncanny Valley

I like text-to-speech. I listen to it for hours a week. I wouldn’t buy an EBR (E-Book Reader) that didn’t have it.

Many people don’t like it.

Many people really don’t like it.

In fact, many people hate it.

Hate it, hate it, hate it.

But why?

I think the reason behind it is fascinating.

First, it’s not simply because it is imperfect. If that was the case, people would hate the 16-grayscale images. Some people certainly find them inadequate, and wish they had color. Β They don’t hate them, though…they don’t find them repulsive.

I don’t think I’ve heard “repulsive”, but I have certainly heard “creepy” many times in association with text-to-speech.

You know what the problem is?

The voice is too good…and still not good enough.

It’s a well-known problem in robotics and animation. If a robot looks like something that doesn’t actually exist, like a purple dragon, it doesn’t have to be accurate.

If it looks kinda sorta like a human, but clearly isn’t, that’s okay.

If it looks almost exactly like a human, but has something slightly wrong (eyes that blink too little or too much, legs that bend too far backwards), it actually becomes less attractive…people may say it is “creepy”.

Sound familiar?

Let’s say you start out with a cube. People probably won’t think it is cute. Then you add stuff to it…maybe little arms. It starts to look more human, and people like it better. Then, you start approaching human…at first, the acceptance curve is up. Then, it gets too close…but still has “defects”. Suddenly, the acceptance curve drops off…people don’t like it. You work out those problems, your robot looks fully human…and people like it again.

There is a sudden sharp dip just before it becomes fully human, and then it raises again when it becomes human.

That deep area when it is almost human, but still “wrong”? That’s called the “Uncanny Valley”.

You can see a graph of it here:

Wikipedia article

The term was coined by roboticist Masahiro Mori (in Japanese) in 1970.

There are some things that seem to be pretty universally off-putting…turning the eyes in a photograph upside-down, for example. Β You can get some sense of that here:

Exploratorium exhibit

Outside of those broad strokes, though, it’s a bit more personal.

Some people were absolutely creeped out by The Polar Express movieΒ in 2004. It was done with motion capture, which probably made the general body movement more realistic than traditional animation. However, motion capture can’t give you realistic eyes…and people complained about the eyes. Unlike a regular cartoon, the motion capture made people judge it at a human level.

It’s going to depend somewhat on your expectations and mindset, I think.

For many users, the Kindle’s text-to-speech is smack dab in the middle of the Uncanny Valley.

Those people may not be bothered by a less realistic voice in a GPS. They aren’t bothered by audiobooks read by actors.

When the TTS mispronounces something…or simply “inhumanly” or “robotically” doesn’t have enough emotional variation? Valley time.

Why do some people have this reaction?

There are several hypotheses.

They generally have to do with the point at which we start judging the simulacrum by the same standards we judge humans.

An instinctively negative reaction to something that is “not quite right” could have to do with evolutionary drives. We had a dog who got along just fine with our other dogs…until she had one of her seizures. When she did, the other two dogs would actually go for her throat…we’d have to protect her. Presumably, they were driven to eliminate a “weakness from the species”.

Repulsion could also do with mate selection, in a similar manner. Yes, we want to reproduce…but something has to turn us off from wanting to reproduce with everyone. If someone had a negative genetic indicator, many people might be turned off by it.

The text-to-speech voices probably sound good enough (they are actually people…they have spoken the words and phrases, and then software assembles it as well as it can) that we judge them as fellow humans.

However, you might be thinking (and I hope you are), that people can find just about anybody acceptable. People outside the genetic mainstream do have people who find them attractive. Humans are interesting that way…we don’t always follow our instincts (no, this is not the time to bring up Rule 34 of the internet) πŸ˜‰ .

We can even change our feelings about things deliberately…or at least, some of us can. We can train ourselves to accept things.

Some people need to get used to the text-to-speech, then they accept it. They learn to “listen with an accent”. At that point, has it become perfectly human? No…I still don’t think the text-to-speech sounds exactly like a human. It may be that they learn not to expect it to sound like a human…so they no longer judge it by human standards.

What about those of us who accept it right away? Maybe we’ve had more experience with synthesized voices? Maybe we’ve already put them in a non-human category? Maybe we focus on the software…I assume that’s part of it for me. I never thought of it as a person.

Hmm…I wonder if it also helps that I was never an audiobook person? If I was used to my books sounding like an actor, it might be more of an adjustment.

Well, I hope that helps some people enjoy TTS more. I really do get a big benefit out of it, and I wish everybody did.

I’m not going to hold my breath until I turn blue, though…some of you would think that was creepy. πŸ˜‰

Update: I’ve had some interesting comments on this post, and they’ve led me to add a poll:

This post by Bufo Calvin originally appeared in the I Love My Kindle blog.

22 Responses to “Text-to-speech and the Uncanny Valley”

  1. Tom Madsen Says:

    Hey Bufo, I like you must have been an early adopter or maybe I was used to being on hold and hearing the non-human voice enough. It never bothered me, and kind of find some mispronounced words humorous. I recently took my K3 on a roadtrip with a buddy of mine that often listens to audio books and he pretty much couldn’t stand it. The convenience in the car for me is wonderful, listening to newspaper articles. My daughter even likes it, and enjoys listening to chapters of her Nancy Drew books. Though she expects it to count her 1/2 hr of daily reading times. πŸ™‚

    • bufocalvin Says:

      Thanks for writing, Tom!

      Yes, before I wrote about it, I hadn’t thought that audiobook users might have been less likely to like text-to-speech…I probably should have.The mispronunciation being funny is telling. That goes back to the basis of humor: apparent danger but no real danger. You and I don’t see the mispronunciations as “dangerous”, but some might.

      Question: why would it not count as reading time?

      • Tom Madsen Says:

        I’m a mean dad…

        She’s 9 so she needs kindle in hand time, reading out loud, learning inflections. We go back and forth, and it can be kind of amusing with the characters. No way I’d give that time up. Oh and she might just start to pronounce words like the kindle πŸ™‚

      • bufocalvin Says:

        Thanks for writing, Tom!

        That doesn’t sound mean to me. πŸ™‚

        I love reading out loud, and my kid has done that from a young age as well.

        You can hear me reading The Happy Little Bookworm here:

        http://www.thekindlechronicles.com/2009/08/28/tkc-extra-the-happy-little-bookworm/

        I am just reading spontaneously there…no microphone tricks. πŸ™‚

  2. Tom Semple Says:

    I like, make that love, having TTS. And actually I am bothered by audiobooks read by actors in some cases, in a way that cannot happen with TTS, simply because I may not resonate with a particular actor’s emotive reading, while TTS is ‘neutral’. (Another reason audiobooks can be frustrating for me is that after using TTS, I miss having the text to read along, or to read instead of listening, and having the ability to search and navigate easily. Perhaps one day we’ll have hybrid media that allows this.)

    I like the theory of the ‘uncanny valley’, but I think it is less subtle than that much of the time. As you suggest, it’s like when listening to an unfamiliar cadence or inflection of English: it just takes a little time for the pattern-recognition to kick in. Certain British accents, for example, are so difficult to pick up on that my wife and I turn on the movie subtitles for awhile. But by the end of the movie, we don’t need them any more. People just have to be a little patient with it, and start imagining how they can use it.

    And it is so useful. Sometimes I just want to close my tired eyes, and this is a way of reading without having to use them. And it is a way two or more people can read together without one having to read aloud.

    And the more I listen to Kindle’s TTS, the more impressed I am with it (except for the occasional blooper). I just wish Amazon would sell us some more voices, so I can hear English literature with a British inflection, Ian Rankin read with a Scottish inflection, or mix it up with an Australian reading of Faulkner.

    • bufocalvin Says:

      Thanks for writing, Tom!

      I’m with you on much of this. I don’t like audiobooks (unless I’ve already read the book), because I don’t like the actors (or other readers) interpreting the work for me. The unchanging affect is a plus for me.

      If it was simply pattern-recognition, than people would say, “I tried text-to-speech but I couldn’t understand it…oh, well.” It’s the revulsion I hear that makes me think it is the Uncanny Valley. Sure, some people just find it hard, and they could get used to it. The people who are very emotional in their rejection? That’s a different process.

      My Significant Other isn’t good with accents, although I am. I thought it was hilarious when BBC America ran a marathon of the original Life on Mars when the ABC version was being promoted. They not only ran it with subtitles, but had a disclaimer, something like, “While British accent can be amusing, they can be difficult to understand.” Now, admittedly, these weren’t just British accents…they were 1970s British police slang. πŸ™‚

      Nuance, the company that makes Vocalizer (the TTS on the K3…and RealSpeak, the TTS on the K2 and the KDX) has many voices available.

      http://www.nuance.com/vocalizer5/flash/index.html

      They have two British English voices, and one each of Scottish English and Irish English. They have many other accents as well.

      For me, I don’t need the voice to be content appropriate, though. It would be fun to listen to different accents, but I wouldn’t choose female for female or British for British. I don’t expect the voices to sound like the characters.

      They also need the different speakers to be able to do different languages effectively…reading Spanish as though it was English doesn’t work very well. πŸ™‚

      I think we will get these options…they may be memory hogs, though, so they might come first on the possible Amazon tablets (perhaps in August).

  3. Marian Says:

    Bufo, some people (like me) may have much less complicated reason not to like text-to-speech. English is not my first language, I don’t live in English speaking country. I can listen to English audiobook in a car without a problem. But listening to text-to-speech is hard for me, I need to concentrate much more. I don’t hear where is the end of the sentence, I don’t recognize direct speech. And of course, emotions are missing as well. etc etc.

    • bufocalvin Says:

      Thanks for writing, Marian!

      I do understand that. Do you find it difficult to hear, or repulsive? It’s the people who find it repulsive that fit the Uncanny Valley hypothesis.

      I have no doubt that familiarity with the language helps. πŸ™‚ I have fun predicting what people are going to say on TV, and I can often complete their sentences or say the next line. If it wasn’t my first language, I’d have trouble with that.

      • Marian Says:

        Definitely not repulsive, just difficult to understand. I watch American movies or TV shows almost daily in English without a problem. I tried to listen to TTS and follow the text in the book, but I survived just one paragraph. I will give it another try with 2-3 pages as you suggested to somebody here.
        Thanks for a very interesting article.

      • bufocalvin Says:

        Thanks for writing, Marian!

        I think because of time zones, I’m hearing more from Europeans first. πŸ™‚ It’s still only 7:30 in the morning here.

        I wonder if a person is less likely to fall into the Uncanny Valley if the accent is different from her or his own? If you generally hear American accents on TV or DVDs (or even in the movies), might that make it less likely for you to expect it to be like you?

        Hmm…intriguingly, that might make the Uncanny Valley worse for Brits if it had a British accent. There are opportunities for some interesting studies here, but I don’t think they’ll be done right away.

  4. Bruce Napier Says:

    Hi

    I think my problem is the lack of choice of voices. As a Brit, I somehow expect to hear a British voice reading it, I guess because that’s how my own sub vocal “reading aloud” sounds inside my head.

    This is odd, too; one of the best audiobooks I ever had was a set of Raymond Chandler shorts read in a very hard boiled American accent, and that was just fine. I guess it’s the Uncanny Valley phenomenon again.

    Any road up, like the Mac TTS, it would be really good to have a Brit option.

    All the best

    Bruce

    • bufocalvin Says:

      Thanks for writing, Bruce!

      That brings up an interesting point…I’ve told this story before, I think in the Amazon Kindle community, but I don’t think I’ve told it here.

      I had read a book and then my Significant Other was reading it (this was back in the days of paper, when we couldn’t read the same book at the same time) πŸ˜‰ .

      My SO said, “I’m having trouble reading the book, because when I hear this one character, I keep hearing Darren McGavin.”

      I said, “What do you mean?”

      “When I hear the character, I hear Darren McGavin.”

      “You HEAR the characters?” ”

      “You DON’T hear the characters?”

      We had quite a discussion about which one of us was crazy. πŸ™‚

      I had the opportunity, as a trainer, to ask lots of people. The answer? I was the strange one. I found about fifteen percent of people were like me and don’t hear or see the characters when they read. It varied by the group…I was teaching computer software at the time. In an Advanced Excel class, many of them didn’t hear or see it. In an Advanced PowerPoint class, everybody did.

      I think I have some advantages (and disadvantages). When I do see a movie, they pretty much can’t have miscast it, as long as it fits the explict description in the book. When we saw the first Harry Potter movie, my SO said Harry didn’t look right. I said that he had messy black hair, a lightning scar, and glasses. My SO said Harry’s chin was wrong. Harry’s chin wasn’t described in the book, so it couldn’t be wrong for me.

      On the other hand, I do think I may have some more problem keeping characters straight. If you “meet” a character in a book, and create a complete physical appearance to go along with the name, you may have a better association when encountering the name later.

  5. Emily Says:

    Interesting article. It wouldn’t have occurred to me that people would have a negative emotional reaction about TTS, but your article made me see why that might be the case. I tried TTS a few times, but I just couldn’t understand it. Even on the slowest setting it seems way to fast to me and I just couldn’t follow it. I tried both the male and the female voice as well. So for me it wasn’t an emotional reaction, just practical. However, your point about getting used to it is a good one and makes me want to give it a try again. Perhaps with a book I’ve already read. FYI, I don’t listen to audiobooks and English is my native language.

    • bufocalvin Says:

      Thanks for writing, Emily!

      Yes, I think listening to a book you’ve already read (and which you know well) could be a good “training experience”. Another thing to try: listen to a book while you are sight-reading it for at least a few pages.

      I listen to it on the fastest speed, but I know different people like different settings.

  6. Mickey Blue Eyes Says:

    I’ve used the TTS a couple times. Mispronouncing words is annoying. What makes TTS really annoying is mispronunciation of dates, e.g., 1945 as “one thousand, nine hundred, and forty five” instead of “nineteen forty five” as most would say it. Also the amount of hesitation at an en dash or comma doesn’t really flow with the narrative.

    Maybe it works better with novels than biographies, blog posts and the like.

  7. karin Says:

    I have used the TTS, when I am on my work out walk (I usually swim for my work out). I use TTS on a book that I have already read, so I have no problem following when the words are mispronounced. What I really like about TTS is that there is no “abridged” version: I can listen in it’s entirety: War and Peace if I want to. I personally don’t like Audio Books, although I have tried them. I much prefer to read them.

  8. Morgan Says:

    I never thought I’d try TTS again but your article has revived my interest πŸ™‚ i don’t find it repulsive and I, like you, don’t “hear” characters or see them in my head while I am reading. It’s just that the TTS doesn’t seem to flow well. I can’t grasp as easily the sentence structure b/c it often rushes past punctuation marks- two sentences sound like one. which is very confusing to me when i’m not familiar with the book. i WISH i loved this feature as you do. i adore audio books but, I don’t know, I can’t seem to get into the TTS. SN- I wonder if the technology is better on the K3. I have a KDX.

  9. Mel Says:

    I tried it for awhile when I first got my Kindle. I hate it for fiction. The lack of inflection really bugs me — not just for emotions, but to denote conversation. I listen to a lot of audio books, and I’ve come to expect at least a slight change of voice to show that someone is speaking even if the narrator doesn’t do different accents for the characters.
    When I try to listen to text to speech on anything with conversations, I keep loosing track of who is talking and even what is said aloud and what isn’t. It’s sort of like the audio version of those annoying books without quotation marks.

    I do like the text to speech feature for nonfiction. Inflection plays a much smaller role there.

  10. E. Ericson Says:

    I’ve never tried text-to-speak so can’t comment on whether I might find it creepy. The reason, though, is the same as for why I’ve never been interested in audiobooks, and why I didn’t let my parents read to me as a kid. I’m an extremely fast reader, and hearing the words spoken (even at high speed) inevitably takes longer than it would take me to read them. So for me it’s not a matter of the voice being too good or not good enough; I just dislike listening to things, period. (The exception being NPR podcasts, which I enjoy while walking, doing the dishes, putting on makeup, or other mundane activities during which reading is impractical.)

    • bufocalvin Says:

      Thanks for writing, E.!

      I appreciate you sharing that. Reading out loud (and being read to) has always been part of my life, so that’s different between you and me…whew, now people can tell us apart. πŸ˜‰

  11. E.B. Says:

    I agree with E. Ericson. I don’t find it creepy or irritating, but just not as pleasant as reading the text. I sometimes use TTS to continue reading when I have to put my book away to do some cooking or cleaning. But as soon as the housework is done, I’ll go back to reading rather than listening. I don’t think a better voice or pronunciation would do much to change this.

    It’s interesting to read your response to E. Ericson, BuvoCalvin. I’ve been wondering about this in the context of your daily freebie flashes, where you state that you filter out books that don’t allow TTS. For me, whether a book allows TTS wouldn’t influence my buying decisions any more than whether or not the cover is green.

    • bufocalvin Says:

      Thanks for writing, E.B.!

      I understand that…I prefer to sight-read over listening, myself…but I prefer listening over literary deprivation. πŸ˜‰

      My reason for not getting or linking to books that block text-to-speech isn’t aesthetic, but because I disagree with the decision. I think it disproportionately disadvantages the disabled. My decision not to reward the publishers by getting those books is philosophical.

      With the ending of the Amazon Associates program in California, I don’t need to worry about profiting monetarily from linking to the books, but I am disinclined to do so, because I don’t want to promote books that have it blocked.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.


%d bloggers like this: