Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

Abstract

Streaming text-to-speech synthesis in cascaded LLM-TTS systems still faces latency challenges as most TTS models require full context before initiating generation. We present S5-TTS, a Streaming variant of T5-TTS that enables low-latency, word-by-word incremental speech synthesis through encoder-decoder language modeling and monotonic alignment learning. S5-TTS begins generating speech immediately after receiving the first few words, substantially reducing end-to-end response latency. To maintain quality under limited lookahead, we introduce a lookahead-causal masking mechanism with Conv-based auxiliary attention that preserves intelligibility and speaker similarity, and employ interleaved multi-source distillation to further restore naturalness. Experiments show that S5-TTS achieves comparable quality to full-context T5-TTS, supports zero-shot synthesis with high speaker similarity, and significantly reduces end-to-end latency for practical conversational AI systems.

Model Overview

Model Overview

Audio Samples

Comparison of S5-TTS (Lookahead Words k=1,2,3), T5-TTS, and Ground Truth on LibriTTS Test Set (Zero-Shot)

Experimental results indicate that S5-TTS with a lookahead of two words (k=2) strikes a balance between naturalness and intelligibility.

Reference Audio S5-TTS (k=1) S5-TTS (k=2) S5-TTS (k=3) T5-TTS Ground Truth
Reference Text: And in this last action he falls into the classic sin of Pride.
Target Text: Our baggage was immediately carried to the deck of the frigate. I rushed aboard.
Reference Text: Chris could catch the shine as lantern glow touched the silver buckles on their shoes.
Target Text: We shall say no more, but I trust you understand the responsibility you have?
Reference Text: Alexander leaned forward and warmed his hands before the blaze.
Target Text: Alexander walked about the room, looking at everything.
Reference Text: In the meantime, the uplifted hand of Hawkeye was above him.
Target Text: At last the scout spoke in English, and at once explained the embarrassment of their situation.
Reference Text: The combined bands of both the countries played the music and a fine supper was served.
Target Text: The immense blade was so heavy that it took the strength of seven Blueskins to raise it.
Reference Text: His father was the notorious Sir Jabez Gilchrist, who ruined himself on the turf.
Target Text: It was a small pyramid of black, putty like stuff, exactly like the one upon the table of the study.
Reference Text: The remainder of the sea monsters have disappeared.
Target Text: In the distance the clouds resemble great bales of cotton, piled up in picturesque disorder.
Reference Text: Their walk continued silent for the greater part, neither was quite satisfied with the other.
Target Text: Miss Rachel only glanced across her mufflers on his face.
Reference Text: The drag upon his beak and the light check upon his wings were inexplicable to him, and appalling.
Target Text: He saw the frantic fugitive dive over the ledge and disappear.
Reference Text: He looked up rather ungraciously, but motioned them to be seated.
Target Text: He was a round faced, respectable appearing fellow, but his mood was distinctly unsociable.
Comparison of S5-TTS (Lookahead Words k=1,2,3), T5-TTS, and Ground Truth on VCTK Corpus (Zero-Shot)

Experimental results indicate that S5-TTS with a lookahead of two words (k=2) strikes a balance between naturalness and intelligibility.

Reference Audio S5-TTS (k=1) S5-TTS (k=2) S5-TTS (k=3) T5-TTS Ground Truth
Reference Text: The Norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.
Target Text: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
Reference Text: This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.
Target Text: The Norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.
Reference Text: Health Secretary Frank Dobson made the surprise announcement in the Commons yesterday.
Target Text: Jim Wallace, the justice minister, acknowledged that prisoner numbers were a concern.
Reference Text: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
Target Text: The actual primary rainbow observed is said to be the effect of super-imposition of a number of bows.
Reference Text: The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.
Target Text: This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.
Reference Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.
Target Text: She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.
Reference Text: The important retail business continues to deliver excellent results, Stewart said.
Target Text: The company hopes to top seven million passengers in its current financial year.
Reference Text: The Norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.
Target Text: These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon.
Reference Text: The actual primary rainbow observed is said to be the effect of super-imposition of a number of bows.
Target Text: That capital would be provided equally by the State and established commercial companies.
Reference Text: Jim Wallace, the justice minister, acknowledged that prisoner numbers were a concern.
Target Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.
Comparison of S5-TTS-D (distilled), S5-TTS, T5-TTS, and Ground Truth on LibriTTS Test Set (Zero-Shot)

Experimental results show that knowledge distillation can effectively improve the naturalness of S5-TTS, achieving quality comparable to T5-TTS.

Reference Audio S5-TTS (k=2) S5-TTS-D (k=2) T5-TTS Ground Truth
Reference Text: In the drawing room, after dinner, their nervousness and awkwardness increased.
Target Text: mrs peter broke off in some confusion, and tripped out to meet her husband in the hall.
Reference Text: When he saw her coming, he stopped and wiped his brow.
Target Text: Telephone wires hum along the white roads, which always run at right angles.
Reference Text: Hilda was writing in her living room, under the light of a tall desk lamp.
Target Text: Marie opened the door and smilingly announced that dinner was served.
Reference Text: The combined bands of both the countries played the music and a fine supper was served.
Target Text: The immense blade was so heavy that it took the strength of seven Blueskins to raise it.
Reference Text: Yet the deep marks of conical teeth upon the iron pick are certainly those of the crocodile.
Target Text: I take this as my answer, and I leave the Professor to bite his lips with impatience.
Reference Text: Why, if we erect a station at the falls, it is a great economy to get it up to the city.
Target Text: A model of this station is preserved in the Deutschen Museum at Munich.
Reference Text: If these articles of consumption are costly, they are felt to be noble and honorific.
Target Text: The salient features of this development of domestic service have already been indicated.
Reference Text: Gordon went down the passage to the end of the hall, in the direction Murdoch had indicated.
Target Text: At one time, there had been a small dome and a tiny city devoted to the rocket field.
Reference Text: An appeal was made to the executive of the state, but little encouragement was returned.
Target Text: Let us observe the contrast between the beginning and the present proportions of the Church.
Reference Text: In the Little Convent there was a centenarian who came from the Abbey of Fontevrault.
Target Text: Fontaines in Burgundy is a country that is blest because it gave him birth.
Comparison of S5-TTS-D (distilled), S5-TTS, and T5-TTS on UltraChat Test Texts (Zero-Shot)

Experimental results show that knowledge distillation can effectively improve the naturalness of S5-TTS, achieving quality comparable to T5-TTS.

Reference Audio S5-TTS (k=2) S5-TTS-D (k=2) T5-TTS Ground Truth
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: Focus your attention on your breath and keep your core muscles tight.
N/A
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: Ensure that the sculpture depicts their essence and captures their spirit, making it easy for anyone who sees it to identify the person.
N/A
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: Christmas Eve is also a birthday party for our daughter.
N/A
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: Thanks for the helpful tips on how emergency personnel can protect their hearing.
N/A
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: This helped me to focus on the positives in my life and to appreciate them more.
N/A
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: He puts on his backpack and trudges off, wistfully wishing things could be different for him.
N/A
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: You can attend these events from the comfort of your own home and still experience different cultures.
N/A
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: They kissed under the stars, and Brooke felt like she was exactly where she was meant to be.
N/A
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: One of the most recognizable patterns used in Moroccan rugs is the diamond shape, which is also known as a lozenge.
N/A
Reference Text: I’m an assistant here to help with questions, provide information, and support you in various tasks.
Target Text: Imagine a world where people can teleport objects with their minds.
N/A