I have an audio file which is m4a format that records a conference call and a transcript file which is in xml format that contains texts of the corresponding conference call. How can I input these two files and do forced alignment. I want to know the time interval of each sentence in the transcript.