Hosting a text-to-speech service on Heroku

Using marytts-http, you can easily host a multilingual open source text-to-speech service on Heroku and restrict requests with HMAC-SHA256.

In college, I made a vocab memorizer and used MaryTTS for speech synthesis to learn the correct German pronunciations.

MaryTTS is an open-source multilingual text-to-speech library in Java, but it doesn't expose a convenient and easy to deploy web service. So I recently built marytts-http, to wrap it as a Java servlet using Jetty.

You can deploy it to Heroku like this (assuming you have their toolbelt and logged in):

git clone https://github.com/draffensperger/marytts-http  
cd marytts-http  
heroku create  
git push heroku master  

Heroku will perform a Maven build for MaryTTS with English and German voices. You can then try the service by specifying text and locale and visiting e.g. [your-marytts-app].herokuapp.com/?text=Hallo&locale=de

Securing requests

You can secure your service by setting the HMAC_SECRET (Base64 encoded) environment variable. To generate a random 32-byte key and set it on Heroku run:

heroku config:set HMAC_SECRET=`cat /dev/urandom | head -c 32 | base64`  

You will need to sign the requests using HMAC-SHA256, and you can also specify the expires parameter to make the signed request only last a specified period of time. The
marytts-http readme has the details and see below for a Rails code example.

Using it in a Rails app

One way to use this in a Rails app would be via the <audio> tag. Assuming you have a view helper, marytts_url, this snippet would caused your page to say "Hallo" in German using your text-to-speech service:

<%= audio_tag(marytts_url('Hallo', locale: :de), autoplay: true) %>  

The view helper to construct and sign the URL with for the service could look like this:

module MaryttsHelper  
  @@marytts_key = Base64.decode64(Rails.application.secrets.marytts_key)
  @@marytts_host = Rails.application.secrets.marytts_host

  def marytts_url(text, opts={})
    # Expiry time is represented as a unix timestamp
    opts[:expires] = opts[:expires].to_i if opts[:expires].present?

    params_to_sign_in_order = [:text, :locale, :gender, :voice, :style, 
                               :effects, :expires]
    params = opts.merge(text: text).slice(*params_to_sign_in_order)
    sign_url(@@marytts_key, @@marytts_host, params)
  end

  private

  def sign_url(key, base_url, params)
    param_values = params.values.map(&:to_s).reduce('', :+)
    signature = hmac_sha256(param_values, key)
    URI.join(base_url, '/?' + params.merge(signature: signature).to_query).to_s
  end

  def hmac_sha256(data, key)
    digest = OpenSSL::HMAC.digest(OpenSSL::Digest.new('sha256'), key, data)
    Base64.encode64(digest).strip
  end
end  

You would set marytts_host and marytts_key in secrets.yml to be
[your-marytts-app].herokuapp.com and your HMAC_KEY above respectively.

Here's a PHP code example of how to embed a marytts-http link as well.

What about French, Italian, Swedish, Russian, Turkish, and Telugu?

MaryTTS supports those languages too! By modifying the marytts-http Maven build script, we could add language and voice packages for them. A list of the voices is in the (non-user-friendly) MaryTTS components.xml.

How does it actually sound?

The German "bits3-hsmm" and English "cmu-slt-hsmm" voices included in marytts-http are space efficient (under 3MB total) but sound a bit tinny:

"Welcome to the world of speech synthesis!"

"Willkommen in der Welt der Sprachsynthese!"

If we used more space for the German "dfki-pavoque-neutral" (425MB) and the British English "dfki-spike" (129MB) voices, the quality would be better:

"This voice is higher quality."

"Diese Stimme ist qualitätvoller."

But given the Heroku slug size limit of 300MB, and the increased RAM needed for those voices, deploying them may take more work and larger dyno sizes.

To try out the various voices and languages, you can download MaryTTS, run their component installer to get the voices and run a local MaryTTS server which provides a web interface for you to interact with the system.

Other text-to-speech options

There are several hosted text-to-speech web services by iSpeech,
AT&T Speech API, Ivona (Amazon), or IBM, and depending on your app needs, a managed API service may be best for you.