Conversation
|
@PeganovAnton could you please review this? need this to merge ASAP to enable QA to test S2S. |
riva/client/nmt.py
Outdated
| Generates speech recognition responses for fragments of speech audio in :param:`audio_chunks`. | ||
| The purpose of the method is to perform speech recognition "online" - as soon as | ||
| audio is acquired on small chunks of audio. | ||
|
|
||
| All available audio chunks will be sent to a server on first ``next()`` call. | ||
|
|
||
| Args: | ||
| audio_chunks (:obj:`Iterable[bytes]`): an iterable object which contains raw audio fragments | ||
| of speech. For example, such raw audio can be obtained with | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import wave | ||
| with wave.open(file_name, 'rb') as wav_f: | ||
| raw_audio = wav_f.readframes(n_frames) | ||
|
|
||
| streaming_config (:obj:`riva.client.proto.riva_asr_pb2.StreamingRecognitionConfig`): a config for streaming. | ||
| You may find description of config fields in message ``StreamingRecognitionConfig`` in | ||
| `common repo | ||
| <https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#riva-proto-riva-asr-proto>`_. | ||
| An example of creation of streaming config: | ||
|
|
||
| .. code-style:: python | ||
|
|
||
| from riva.client import RecognitionConfig, StreamingRecognitionConfig | ||
| config = RecognitionConfig(enable_automatic_punctuation=True) | ||
| streaming_config = StreamingRecognitionConfig(config, interim_results=True) | ||
|
|
||
| Yields: | ||
| :obj:`riva.client.proto.riva_asr_pb2.StreamingRecognizeResponse`: responses for audio chunks in | ||
| :param:`audio_chunks`. You may find description of response fields in declaration of | ||
| ``StreamingRecognizeResponse`` | ||
| message `here | ||
| <https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#riva-proto-riva-asr-proto>`_. |
There was a problem hiding this comment.
The docstring needs to updated.
| nchannels = 1 | ||
| if args.list_input_devices: | ||
| riva.client.audio_io.list_input_devices() | ||
| return |
There was a problem hiding this comment.
| return | |
| return | |
| if args.list_output_devices: | |
| riva.client.audio_io.list_output_devices() | |
| return |
| sound_stream = riva.client.audio_io.SoundCallBack( | ||
| args.output_device, nchannels=nchannels, sampwidth=sampwidth, framerate=44100 | ||
| ) | ||
| print(sound_stream) |
There was a problem hiding this comment.
Why do we need this print?
| if args.output_device is not None or args.play_audio: | ||
| print("playing audio") | ||
| sound_stream = riva.client.audio_io.SoundCallBack( | ||
| args.output_device, nchannels=nchannels, sampwidth=sampwidth, framerate=44100 |
There was a problem hiding this comment.
Maybe we should make framerate a parameter of the script, like --sample-rate-hz in the script tts/talk.py?
| sampwidth = 2 | ||
| nchannels = 1 |
There was a problem hiding this comment.
sampwidth and nchannels are set in 2 places: here and in play_responses() function. Could you make global variables?
| "then the default output audio device will be used.", | ||
| ) | ||
|
|
||
| parser = add_asr_config_argparse_parameters(parser, profanity_filter=True) |
There was a problem hiding this comment.
You'll probably need to set max_alternatives=False and word_time_offsets=False because these parameters are pointless for the script. Do you think we also need to add speaker_diarization=False flag?
| parser.add_argument("--output-device", type=int, help="Output device to use.") | ||
| parser.add_argument("--target-language-code", default="en-US", help="Language code of the output language.") | ||
| parser.add_argument( | ||
| "--play-audio", |
There was a problem hiding this comment.
If --play-audio is not set, then the script doesn't give any output. We probably should add --output parameter as in tts/talk.py so that the script could produce some output on server.
| play_responses(responses=nmt_service.streaming_s2s_response_generator( | ||
| audio_chunks=audio_chunk_iterator, | ||
| streaming_config=s2s_config), sound_stream=sound_stream) |
There was a problem hiding this comment.
| play_responses(responses=nmt_service.streaming_s2s_response_generator( | |
| audio_chunks=audio_chunk_iterator, | |
| streaming_config=s2s_config), sound_stream=sound_stream) | |
| play_responses( | |
| responses=nmt_service.streaming_s2s_response_generator( | |
| audio_chunks=audio_chunk_iterator, | |
| streaming_config=s2s_config, | |
| ), | |
| sound_stream=sound_stream | |
| ) |
| interim_results=True, | ||
| ), | ||
| translation_config = riva.client.TranslationConfig( | ||
| target_language_code=args.target_language_code, |
There was a problem hiding this comment.
Here should be source_language_code and, probably, model_name as in config.
| first = True # first tts output chunk received | ||
| auth = riva.client.Auth(args.ssl_cert, args.use_ssl, args.server) | ||
| nmt_service = riva.client.NeuralMachineTranslationClient(auth) | ||
| s2s_config = riva.client.StreamingTranslateSpeechToSpeechConfig( |
There was a problem hiding this comment.
Do we need a tts_config as in proto? If so, then we could add a add_tts_config_argparse_parameters() function to argparse_utils.py function and refactor tts/talk.py using this function.
ba394ef to
d2213b6
Compare
d2213b6 to
db64efc
Compare
db64efc to
b665b2f
Compare
Adding speech to speech basic cli