Making a Text-to-speech Blog Reader

Rian Schmidt

April 06, 2024

Table of Contents:

Accessibility for Lazy People
Enter Crappy Text-to-speech
That's all changed.
Making a Reader: Quick and Dirty
But Who's Going to Read It If Not Snoop Dogg?
A Few Gotchas to Consider

Accessibility for Lazy People

Accessibility is a worthy pursuit. It's always a challenge to design things so that anyone who wants to consume the information can. You have to look at text-background contrast ratios, labelling of buttons, even header orders. It's something I strive to do in my work.

Also, you know how humans are so lazy that they can't be bothered to even read? Well, I thought I'd lean into both of those ideas and record my own blog entries like some kind of toe-dip into creating a podcast. That way, people could just hit a button and bask in the warming glow of my passing thoughts without having to exert themselves quite so much.

But, then, you know how humans are so lazy that they can't be bothered to even read their own writing, much less edit out all the "ummms" and lip-smacking and stuff?

Enter Crappy Text-to-speech

Text-to-speech (TTS) has been around quite a long time. Until fairly recently, though, it was painful to listen to. Most versions just swapped out a sound for a written word without any consideration for context. A word sounded the same at the beginning of a sentence or the end. It sounded the same in a question as a statement. And in most cases, it sounded like it was being read by a 1960s science fiction robot.

That's all changed.

TTS technology has advanced incredibly in the last couple of years. Nowadays, you can have a pretty realistic Snoop Dogg read your stuff. It's quite good at intonation and pacing. You can even add markup to your text to give it hints about pauses and pronunciation.

Some services allow you to clone your own voice (and avatar) so that you can read your material yourself, but... not.

Making a Reader: Quick and Dirty

For my project, I did a quick scan of what was out there and decided to just go with AWS Polly. It's cheap, and I've already got all the AWS account set up, so I just needed to give myself some additional access in IAM to allow it.

I won't dump the sample code here, but the basic idea is that the plan would be to check if a recording exists in my cache, and if it doesn't submit the job to Polly to create it. Then, when complete, put it in the cache and play it for the user.

The user clicks the button, the request is submitted to a Remix action function on the server. That function checks S3 for a recording tagged with the article slug. If found, it just returns the full filename back to the browser to play.

If not, the job is submitted to Polly via the AWS SDK, and the user is informed that it's encoding and to check back later. It only takes a minute to do the reading, but I didn't want to manage an async job and the associated complexity of tracking it. Also, this only happens once per article due to the caching of the recordings.

But Who's Going to Read It If Not Snoop Dogg?

It turns out that Polly has a few-- 47 at last count-- options for voices and engines to generate those voices-- standard, neural, and long-form. Long story short, so to speak, I went with the long-form because, as AWS says:

Amazon Polly has a long-form engine that produces human-like, highly expressive, and emotionally adept voices. Long-form voices are designed to captivate listeners’ attention for longer content, such as news articles, training materials, or marketing videos.

That pretty much made the decision for me. There were only three voices with that engine, and I liked the prosody of the dude best.

You can check out the results above, with the "Read This to Me" button. I think it sounds amazing. His name's Gregory. Good job, Gregory.

A Few Gotchas to Consider

I'm pretty happy with the result, but here are a few things to consider when whipping up such a thing.

  • Caching: You probably don't want to regenerate the recording every time someone asks for it. Why? It doesn't change, so cache it somewhere and only generate it once. On the other hand, you might want to re-generate it, so consider how to manage that.
  • Distribution: I ended up going with an S3 bucket behind a Cloudfront CDN. Now, I don't get much traffic. I only added Cloudfront because, for reasons unclear, S3 would not serve the audio file. It'd go into pending state and just time out. No such problem with Cloudfront.
  • Cost: There are some pricey options and some cheap options. The pricey ones seem primarily to offer more customization in terms of having Laurence Fishburne or whoever read your stuff. Ultimately, my goal was just a clear, natural sounding voice, and Polly fit the bill with 500K characters per year in the free tier. Now, if I get over that, it's $100/million characters, which isn't nothing, so I'll need to keep an eye on it.
Circinaut is a Fractional CTO services provider, based in Portland, Oregon, working with clients all across the country. I focus on application development, technology advising, and ongoing support for small and medium-sized businesses.
If your business is in need of a part-time CTO, a fractional CTO, or a contract technical consultant, drop me a line. I'm happy to have a quick chat to discuss your situation with no sales pressure at all (really!).