Streaming text-to-speech synthesis in cascaded LLM-TTS systems still faces latency challenges as most TTS models require full context before initiating generation. We present S5-TTS, a Streaming variant of T5-TTS that enables low-latency, word-by-word incremental speech synthesis through encoder-decoder language modeling and monotonic alignment learning. S5-TTS begins generating speech immediately after receiving the first few words, substantially reducing end-to-end response latency. To maintain quality under limited lookahead, we introduce a lookahead-causal masking mechanism with Conv-based auxiliary attention that preserves intelligibility and speaker similarity, and employ interleaved multi-source distillation to further restore naturalness. Experiments show that S5-TTS achieves comparable quality to full-context T5-TTS, supports zero-shot synthesis with high speaker similarity, and significantly reduces end-to-end latency for practical conversational AI systems.
Experimental results indicate that S5-TTS with a lookahead of two words (k=2) strikes a balance between naturalness and intelligibility.
| Reference Audio | S5-TTS (k=1) | S5-TTS (k=2) | S5-TTS (k=3) | T5-TTS | Ground Truth |
|---|---|---|---|---|---|
| Reference Text:
And in this last action he falls into the classic sin of
Pride. Target Text: Our baggage was immediately carried to the deck of the frigate. I rushed aboard. | |||||
| Reference Text:
Chris could catch the shine as lantern glow touched the silver
buckles on their shoes. Target Text: We shall say no more, but I trust you understand the responsibility you have? | |||||
| Reference Text:
Alexander leaned forward and warmed his hands before the
blaze. Target Text: Alexander walked about the room, looking at everything. | |||||
| Reference Text:
In the meantime, the uplifted hand of Hawkeye was above
him. Target Text: At last the scout spoke in English, and at once explained the embarrassment of their situation. | |||||
| Reference Text:
The combined bands of both the countries played the music and a
fine supper was served. Target Text: The immense blade was so heavy that it took the strength of seven Blueskins to raise it. | |||||
| Reference Text:
His father was the notorious Sir Jabez Gilchrist, who ruined
himself on the turf. Target Text: It was a small pyramid of black, putty like stuff, exactly like the one upon the table of the study. | |||||
| Reference Text:
The remainder of the sea monsters have
disappeared. Target Text: In the distance the clouds resemble great bales of cotton, piled up in picturesque disorder. | |||||
| Reference Text:
Their walk continued silent for the greater part, neither was
quite satisfied with the other. Target Text: Miss Rachel only glanced across her mufflers on his face. | |||||
| Reference Text:
The drag upon his beak and the light check upon his wings were
inexplicable to him, and appalling. Target Text: He saw the frantic fugitive dive over the ledge and disappear. | |||||
| Reference Text:
He looked up rather ungraciously, but motioned them to be
seated. Target Text: He was a round faced, respectable appearing fellow, but his mood was distinctly unsociable. | |||||
Experimental results indicate that S5-TTS with a lookahead of two words (k=2) strikes a balance between naturalness and intelligibility.
| Reference Audio | S5-TTS (k=1) | S5-TTS (k=2) | S5-TTS (k=3) | T5-TTS | Ground Truth |
|---|---|---|---|---|---|
| Reference Text:
The Norsemen considered the rainbow as a bridge over which the
gods passed from earth to their home in the
sky. Target Text: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. | |||||
| Reference Text:
This is a very common type of bow, one showing mainly red and
yellow, with little or no green or blue. Target Text: The Norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky. | |||||
| Reference Text:
Health Secretary Frank Dobson made the surprise announcement
in the Commons yesterday. Target Text: Jim Wallace, the justice minister, acknowledged that prisoner numbers were a concern. | |||||
| Reference Text:
Six spoons of fresh snow peas, five thick slabs of blue
cheese, and maybe a snack for her brother Bob. Target Text: The actual primary rainbow observed is said to be the effect of super-imposition of a number of bows. | |||||
| Reference Text:
The Greeks used to imagine that it was a sign from the gods to
foretell war or heavy rain. Target Text: This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue. | |||||
| Reference
Text: When a man looks for something beyond his reach,
his friends say he is looking for the pot of gold at the end
of the rainbow. Target Text: She can scoop these things into three red bags, and we will go meet her Wednesday at the train station. | |||||
| Reference
Text: The important retail business continues to
deliver excellent results, Stewart said. Target Text: The company hopes to top seven million passengers in its current financial year. | |||||
| Reference
Text: The Norsemen considered the rainbow as a bridge
over which the gods passed from earth to their home in the
sky. Target Text: These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. | |||||
| Reference
Text: The actual primary rainbow observed is said to
be the effect of super-imposition of a number of
bows. Target Text: That capital would be provided equally by the State and established commercial companies. | |||||
| Reference
Text: Jim Wallace, the justice minister, acknowledged
that prisoner numbers were a concern. Target Text: When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow. | |||||
Experimental results show that knowledge distillation can effectively improve the naturalness of S5-TTS, achieving quality comparable to T5-TTS.
| Reference Audio | S5-TTS (k=2) | S5-TTS-D (k=2) | T5-TTS | Ground Truth |
|---|---|---|---|---|
| Reference
Text: In the drawing room, after dinner, their
nervousness and awkwardness increased. Target Text: mrs peter broke off in some confusion, and tripped out to meet her husband in the hall. | ||||
| Reference
Text: When he saw her coming, he stopped and wiped his
brow. Target Text: Telephone wires hum along the white roads, which always run at right angles. | ||||
| Reference
Text: Hilda was writing in her living room, under the
light of a tall desk lamp. Target Text: Marie opened the door and smilingly announced that dinner was served. | ||||
| Reference
Text: The combined bands of both the countries played
the music and a fine supper was served. Target Text: The immense blade was so heavy that it took the strength of seven Blueskins to raise it. | ||||
| Reference
Text: Yet the deep marks of conical teeth upon the
iron pick are certainly those of the
crocodile. Target Text: I take this as my answer, and I leave the Professor to bite his lips with impatience. | ||||
| Reference
Text: Why, if we erect a station at the falls, it is
a great economy to get it up to the city. Target Text: A model of this station is preserved in the Deutschen Museum at Munich. | ||||
| Reference
Text: If these articles of consumption are costly,
they are felt to be noble and honorific. Target Text: The salient features of this development of domestic service have already been indicated. | ||||
| Reference
Text: Gordon went down the passage to the end of the
hall, in the direction Murdoch had
indicated. Target Text: At one time, there had been a small dome and a tiny city devoted to the rocket field. | ||||
| Reference
Text: An appeal was made to the executive of the
state, but little encouragement was
returned. Target Text: Let us observe the contrast between the beginning and the present proportions of the Church. | ||||
| Reference
Text: In the Little Convent there was a centenarian
who came from the Abbey of Fontevrault. Target Text: Fontaines in Burgundy is a country that is blest because it gave him birth. | ||||
Experimental results show that knowledge distillation can effectively improve the naturalness of S5-TTS, achieving quality comparable to T5-TTS.
| Reference Audio | S5-TTS (k=2) | S5-TTS-D (k=2) | T5-TTS | Ground Truth |
|---|---|---|---|---|
| Reference
Text: I’m an assistant here to help with questions,
provide information, and support you in various
tasks. Target Text: Focus your attention on your breath and keep your core muscles tight. | ||||
| N/A | ||||
| Reference
Text: I’m an assistant here to help with questions,
provide information, and support you in various
tasks. Target Text: Ensure that the sculpture depicts their essence and captures their spirit, making it easy for anyone who sees it to identify the person. | ||||
| N/A | ||||
| Reference
Text: I’m an assistant here to help with questions,
provide information, and support you in various
tasks. Target Text: Christmas Eve is also a birthday party for our daughter. | ||||
| N/A | ||||
| Reference
Text: I’m an assistant here to help with questions,
provide information, and support you in various
tasks. Target Text: Thanks for the helpful tips on how emergency personnel can protect their hearing. | ||||
| N/A | ||||
| Reference
Text: I’m an assistant here to help with questions,
provide information, and support you in various
tasks. Target Text: This helped me to focus on the positives in my life and to appreciate them more. | ||||
| N/A | ||||
| Reference
Text: I’m an assistant here to help with
questions, provide information, and support you in
various tasks. Target Text: He puts on his backpack and trudges off, wistfully wishing things could be different for him. | ||||
| N/A | ||||
| Reference
Text: I’m an assistant here to help with
questions, provide information, and support you in
various tasks. Target Text: You can attend these events from the comfort of your own home and still experience different cultures. | ||||
| N/A | ||||
| Reference
Text: I’m an assistant here to help with
questions, provide information, and support you in
various tasks. Target Text: They kissed under the stars, and Brooke felt like she was exactly where she was meant to be. | ||||
| N/A | ||||
| Reference
Text: I’m an assistant here to help with
questions, provide information, and support you in
various tasks. Target Text: One of the most recognizable patterns used in Moroccan rugs is the diamond shape, which is also known as a lozenge. | ||||
| N/A | ||||
| Reference
Text: I’m an assistant here to help with
questions, provide information, and support you in
various tasks. Target Text: Imagine a world where people can teleport objects with their minds. | ||||
| N/A | ||||