Speech-to-text

Online (Stream) recognition API

To provide speech recognition for situations, where fast answer is necessary, online recognition API is necessary. This API is based on a WebSocket protocol.

To see your available systems, please make get request to https://services.tilde.com/service/asr/systems?systemType=Online and add authorization header ( or x-api-key if you use api key) in your request.

Opening a session

To open a session, connect to the specified server websocket address:

wss://services.tilde.com/service/asr/ws/<SYSTEM>?x-api-key=<API-KEY> where:

<SYSTEM>: ASR system string identifier.
<API-KEY>: Authentication key

Optional query params:

cahnnelCount: Audio channel count. Default value: 1.
contentType: Optional. Audio content type. Default value: audio/x-raw
sampleRate: Optional. Audio sample rate. Default value: 16000
headerlessAudio: Set to true if no content type is passed and ASR worker should determine it based on stream. It will not pass contenttype, samplerate and channel count to worker.
noResponseTimeoutMS: If no result for given milliseconds, then api closes connection (note that this is not recommended way and client should send EOS to stop stream).

After the session is opened the client application shall send an intro message (empty json if no advanced option is specified):

{ 

}

Sending audio

Speech should be sent to the server in raw blocks of binary data, using the encoding specified when session was opened. It is recommended that a new block is sent at least 4 times per second (less frequent blocks would increase the recognition lag). Blocks do not have to be of equal size.

If there is no speech in the audio, a special 3-byte ANSI-encoded string "EOS" ("end-of-stream") can be sent to tell the server that no more speech is coming and the recognition hypothesis can be finalized.

After sending "EOS", client has to keep the websocket open to receive recognition results from the server. If speech is detected after "EOS" sent, the client can continue to send audio buffers via the same websocket.

When recognition result is finalized and sent to the client, the server sends JSON message which looks like this:

{ 
  "adaptation_state": {
    "type": "<data encoding, e.g. string+gzip+base64>",
    "value": "<adaptation data, e.g. iVector>"
  }
}

Reading results

Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:

{ 
  "status": "response status (integer), see codes below", 
  "result": { 
    "hypotheses": [
      { 
        "utterance": "<transcription>", 
        "confidence": "<(optional) confidence of the hypothesis (float, 0..1)>" 
      }
    ], 
    "final": "<true when the hypothesis is final, i.e., doesn't change any more>"
  } 
}

The following status codes are currently in use:

0 – Success. Usually used when recognition results are sent
2 – Aborted. Recognition was aborted for some reason.
1 – No speech. Sent when the incoming audio contains a large portion of silence or non-speech.
9 – Not available. Max load limit reached.
10 – Authentication failed.
11 – All recognition workers are currently in use and real-time recognition is not possible.

Websocket is always closed by the server after sending a non-zero status update (except for status code 11).

Examples of server responses:

{"status": 9}

{"status": 0, "result": {"hypotheses": [{"transcript": "see on"}], "final": false}}

{"status": 0, "result": {"hypotheses": [{"transcript": "see on teine lause."}], "final": true}}

Server segments incoming audio on the fly. For each segment, many non-final hypotheses, followed by one final hypothesis are sent. Non-final hypotheses are used to present partial recognition hypotheses to the client. A sequence of non-final hypotheses is always followed by a final hypothesis for that segment. After sending a final hypothesis for a segment, server starts decoding the next segment, or closes the connection, if all audio sent by the client has been processed. Client is responsible for presenting the results to the user in a way suitable for the application.

Ending the session

To end the recognition session the client sends the special 3-byte ANSI-encoded string "EOS". The client has to keep the websocket open to receive remaining recognition results from the server.

When recognition is finalized and all remaining results are sent to the client, the server sends JSON message with speaker adaptation data:

{ 
  "adaptation_state": {
    "type": "<data encoding, e.g. string+gzip+base64>",
    "value": "<adaptation data, e.g. iVector>"
  }
}

The client can now close the websocket.

Advanced features

Intro message (see section Opening a session) can include instructions for enabling and configuring various advanced features.

Setting post-processors

Speech recognition results postprocessing can be configured by providing the following options:

{ 
  "enable-partial-postprocess": "<postprocessors>",
  "enable-final-postprocess": "<postprocessors>",
  "enable-postprocess": "<postprocessors>",
  ...
}

where <postprocessors> can be:

A list of postprocessors to enable, e.g. ["numbers_all", "exampe2", ...]
A dictionary, representing postprocessors and processing options. E.g.

{ 
  "numbers_all": "escape", 
  "numbers": "auto", 
  "commands2": { "commands": ["_$_ESCAPE_$_", "_$_NEW_LINE_$_", "_$_DELETE_LEFT_$_"] } 
}

"enable-partial-postprocess" applies postprocessing to partial results from ASR and can significantly increase the latency. "enable-full-postprocess" applies postprocessing to full results only. "enable-postprocess" is a shorthand for applying postprocessor to both final and partial results. Specifically, for partial results, ASR uses a combination (union of 2 sets) of the "enable-partial-postprocess" and "enable-postprocess" settings, while for final results, it uses a combination of the "enable-full-postprocess" and "enable-postprocess" settings.

List of available postprocessors is specific for each ASR system. Below is the list of standard postprocessors:

"numbers" - rewrites most numbers with digits, e.g. eleven -> 11, but one -> one
"numbers_all" - rewrites ALL numbers with digits, e.g. eleven -> 11, one -> 1
"punct" - performs capitalization and punctuation of recognized text

Vocabulary expansion

While submitting recognition request the client can optionally provide parameter „expand-vocab“ which will contain list of words to be „added to the vocabulary“.

{ 
  ...
  "expand-vocab": "<vocabulary>"
}

vocabulary is a JSON array containing following entries for each word:

{ "spoken_form": "<spoken_form>", "word": "<word>", "weight": "<weight>" }

where

<spoken_form>: word written as it is pronounced.

<word>: optional. If word is written differently from it’s “pronunciation”, replaces it with written form.

<weight>: optional. Default value is 1. Increasing weight increases the probability of the word to be recognized, but can degrade overall recognition accuracy.

Example:

[
  { "spoken_form": "šeir", "word": "share", "weight": "2" },
  { "spoken_form": "Askars" }
]

Opening a session​

Sending audio​

Reading results​

Ending the session​

Advanced features​

Setting post-processors​