Subtitle Edit and Whisper

9th Jan 2025 15:01 #1
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
I tried FasterWhisper XXL, but it didn't work very well. CPP works much better (it's a concert film and FWXXL skips over most of the singing but CPP is accurately transcribing lyrics), but it's taking about 6x the video length. My GPU, meanwhile, is sitting on 5%, according to Task Manager's Performance tab, which indicates that it's not doing much of anything.

So is there a setting I'm missing to make it use the GPU? Also, is there a setting to allow it to create SDH subs? My understanding is that Whisper is capable of doing this, but there doesn't seem to be anything within Subtitle Edit for that.

I'm running Win10 with an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz and NVIDIA GeForce GTX 1060 6GB.

Quote
9th Jan 2025 16:39 #2
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
Originally Posted by koberulz

I tried FasterWhisper XXL, but it didn't work very well.

Because you didn't set the proper settings for your audio. For the concerts you want to use better VAD:

Code:

--vad_method pyannote_v3

and background noise removal:

Code:

--ff_mdx_kim2

Not sure that by default 6GB will be enough for mdx_kim2, report how it went.
InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation
Quote
9th Jan 2025 17:49 #3
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
Originally Posted by VoodooFX

Originally Posted by koberulz

I tried FasterWhisper XXL, but it didn't work very well.

Because you didn't set the proper settings for your audio. For the concerts you want to use better VAD:

Code:

--vad_method pyannote_v3

and background noise removal:

Code:

--ff_mdx_kim2

Not sure that by default 6GB will be enough for mdx_kim2, report how it went.

I have no idea how I'm supposed to even set these? Or what they mean? I just pulled up the dialog in SE and clicked "generate".
Quote
9th Jan 2025 17:56 #4
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
These are parameters, in the same window where is "Generate" there is "Advanced" button, you add parameters there.

InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation

Quote
9th Jan 2025 19:49 #5
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
What's VAD? What do these actually do?

I'm more confused on the SDH thing, because I canceled the CPP process that was running and it actually had SDH in it, which the previous CPP pass (which froze) didn't.

Quote
9th Jan 2025 20:20 #6
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
Originally Posted by koberulz

What's VAD?

It's voice detection, detected voice is passed to whisper to transcribe, non-detected areas are skipped.

Originally Posted by koberulz

I'm more confused on the SDH thing...

Original whisper models are not meant to do any "SDH" thing, you are observing so-called hallucinations.

InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation

Quote
9th Jan 2025 20:23 #7
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
Originally Posted by VoodooFX

Original whisper models are not meant to do any "SDH" thing, you are observing so-called hallucinations.

Ah, yes, it's mere coincidence that it keeps hallucinating the subtitle "[APPLAUSE]" every time the crowd applauds, and then "[MUSIC]" when the band starts playing.

Looks like FasterWhisper with your suggested settings is going to take about 24 hours to run, which is a bit more than the 10 hours it was going to take CPP.

Quote
9th Jan 2025 20:32 #8
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
Originally Posted by koberulz

Ah, yes, it's mere coincidence that it keeps hallucinating the subtitle "[APPLAUSE]" every time the crowd applauds, and then "[MUSIC]" when the band starts playing.

It's not coincidence that you have no clue what you are talking about.

InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation

Quote
9th Jan 2025 21:33 #9
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
Well the FasterWhisper pass with your suggested settings completed well ahead of its initial 24-hour estimate...and was just as bad as the first time. Most lyrics are absent entirely, and the few lines that did get transcribed are just absolute nonsense.

Weirdly, I looked at the second CPP pass - the one with the SDH subs - and it didn't subtitle any of the lyrics either. No idea why it did so well on the lyrics the first time around. Unless it wasn't CPP, but one of the other engines? But again, that was going to take 12 hours, and the GPU didn't seem to be doing anything, so I'm not sure what the story is there.

Quote
9th Jan 2025 22:05 #10
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
Originally Posted by koberulz

...completed well ahead of its initial 24-hour estimate...and was just as bad as the first time.

SE estimations can be wildly inaccurate, you would need to use the tools directly to see the accurate ETAs.

If it didn't helped then most likely your audio is not good for transcription. What model did you use?
Cut and share a minute sample of that singing.

Originally Posted by koberulz

...I'm not sure what the story is there.

Whisper model is non-deterministic by default, you can have completely different results on every run, especially on bad material.

InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation

Quote
9th Jan 2025 22:24 #11
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
Originally Posted by VoodooFX

If it didn't helped then most likely your audio is not good for transcription.

Well, it worked once, perfectly, across the entire opening song. I didn't check further than that, and it appears I've accidentally deleted the results of that pass at some point. But it's not like it only picked up a line or two.

What model did you use?

Good point. I think the frozen CPP pass, which worked, was large-v3-turbo, and the other one, which didn't pick up the lyrics, was medium. I tried spinning up large-v3-turbo-q5, which is significantly smaller, via CPP and that's...estimating 24 hours again. The FasterWhisper pass was large-v3.

Sample attached.

Attached Files

sample.mkv (37.34 MB, 48 views)
Quote

10th Jan 2025 00:44 #12

Video Damager

ff_mdx_kim2 doesn't work well on this audio, don't use it.

--model large-v2 -vad false:

Code:

[00:01.000 --> 00:03.400]  Boys, fire it up!
[00:13.740 --> 00:16.060]  Right, right, turn off the lights
[00:16.060 --> 00:18.100]  We're gonna lose our minds tonight
[00:22.180 --> 00:24.580]  I love this all too much
[00:24.580 --> 00:26.760]  5 AM turn the radio off
[00:26.760 --> 00:28.740]  There's no rock and roll
[00:30.000 --> 00:33.380]  We're gonna lose our minds tonight
[00:33.380 --> 00:36.540]  Money-crashin', fanny-snatchin'
[00:37.260 --> 00:40.700]  Call me up if you a gangsta
[00:40.700 --> 00:45.000]  Call me fancy, just get dancy
[00:45.520 --> 00:48.840]  And I will say yes

--model large-v2 --vad_method pyannote_v3:

Code:

[00:00.790 --> 00:03.380]  Boys, fire it up!
[00:13.900 --> 00:18.100]  Ride, ride, turn off the lights We're gonna lose our minds tonight
[00:22.460 --> 00:26.750]  I love this all too much 5 AM turn the radio on
[00:26.750 --> 00:29.350]  There's no rock and roll
[00:33.720 --> 00:40.560]  Snatch it, fanny snatch it Call me up if you a gangster
[00:41.560 --> 00:44.560]  Call me a pansy, just get dancy
[00:45.840 --> 00:47.980]  I'm so serious

Last edited by VoodooFX; 10th Jan 2025 at 00:50.

InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation

Quote

10th Jan 2025 01:24 #13
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
Originally Posted by VoodooFX

ff_mdx_kim2 doesn't work well on this audio, don't use it.

I don't know what it is or what it's doing, how am I supposed to know whether it works well or not?

Quote
10th Jan 2025 02:45 #14
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
Originally Posted by koberulz

I don't know what it is or what it's doing

It pre-process audio with the voice extraction model.

Originally Posted by koberulz

how am I supposed to know

I heard that reading help is the good start to know things.

EDIT:

Instead of "ff_mdx_kim2" I tried BS-RoFormer (which is the current SOTA for vocal extraction, "mdx_kim2" was one of the best in 2023):

Code:

[00:00.780 --> 00:03.400] Boys, fire it up! [00:13.630 --> 00:16.050] Right, right, turn off the lights [00:16.050 --> 00:18.150] We're gonna lose our minds tonight [00:22.510 --> 00:24.590] I love this all too much [00:24.590 --> 00:26.830] 5 AM, turn the radio off [00:26.830 --> 00:29.030] There's the rock'n'roll [00:33.060 --> 00:36.260] Fanny Crasher, Fanny Snatcher [00:37.140 --> 00:40.660] Call me up if you a gangsta [00:41.560 --> 00:45.000] Don't be flancy, just get dancy [00:45.660 --> 00:48.040] Why so serious?

Btw, Mel-RoFormer shows pretty good results too.
Last edited by VoodooFX; 10th Jan 2025 at 03:06.

InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation
Quote
10th Jan 2025 03:35 #15
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
Originally Posted by VoodooFX

It pre-process audio with the voice extraction model.

I have no idea what that means.

I heard that reading help is the good start to know things.

I don't even know what I'm supposed to be reading!

Instead of "ff_mdx_kim2" I tried BS-RoFormer (which is the current SOTA for vocal extraction, "mdx_kim2" was one of the best in 2023)

Do I need to install something else here? The Advanced screen in Subtitle Edit doesn't seem to mention this one.

Quote
10th Jan 2025 03:47 #16
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
Originally Posted by koberulz

I have no idea what that means.

C'est la vie.

Originally Posted by koberulz

Do I need to install something else here?

It's not implemented in Faster-Whisper-XXL.

InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation

Quote
10th Jan 2025 04:03 #17
pcspeak

View Profile

View Forum Posts

Private Message
Member

Join Date
Apr 2007

Location
Australia
This is using Faster-Whisper-XXL r239.1
https://github.com/Purfview/whisper-standalone-win/releases/tag/Faster-Whisper-XXL
This is the command line script I used.

Code:

"D:\Whisper-XXL\faster-whisper-xxl.exe" "D:\aa\sample.mkv" --model large-v3-turbo --vad_filter false --sentence --verbose true -o source

The result looks good. Yes/No? 11.5 seconds. faster-whisper chose to use compute type: int8_float32
I don't have a particularly powerful video card (see in red)

Standalone Faster-Whisper-XXL r239.1 running on: CUDA
Number of visible GPU devices: 1
Supported compute types by GPU: {'int8_float16', 'int8', 'float16', 'int8_float32', 'float32'}

Note: 'large-v3' model may produce worse results than 'large-v2'!

[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] CPU: GenuineIntel (SSE4.1=true, AVX=true, AVX2=true, AVX512=false)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Selected ISA: AVX2
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Use Intel MKL: true
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - SGEMM backend: MKL (packed: false)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - GEMM_S16 backend: MKL (packed: false)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - GEMM_S8 backend: MKL (packed: false, u8s8 preferred: true)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] GPU #0: NVIDIA GeForce GTX 1650 SUPER (CC=7.5)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Allow INT8: true
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Allow FP16: true (with Tensor Cores: true)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Allow BF16: false
[2025-01-10 19:42:11.958] [ctranslate2] [thread 8056] [info] Using CUDA allocator: cuda_malloc_async
[2025-01-10 19:42:12.186] [ctranslate2] [thread 8056] [info] Loaded model D:\Whisper-XXL\_models\faster-whisper-large-v3-turbo on device cuda:0
[2025-01-10 19:42:12.186] [ctranslate2] [thread 8056] [info] - Binary version: 6
[2025-01-10 19:42:12.187] [ctranslate2] [thread 8056] [info] - Model specification revision: 3
[2025-01-10 19:42:12.187] [ctranslate2] [thread 8056] [info] - Selected compute type: int8_float32

Faster-Whisper's large-v3-turbo model loaded in: 4.14 seconds

Starting sequential inference to transcribe: d:\aa\sample.mkv

Processing audio with duration 01:00.000

Detecting language using up to the first 30 seconds. Use `--language` to specify the language.
[2025-01-10 19:42:13.225] [ctranslate2] [thread 12648] [info] Loaded cuBLAS library version 12.1.3
Detected language 'English' with probability 0.97

Processing segment at 00:00.000
[00:00.000 --> 00:03.360] Boys, fire it up!
[00:13.640 --> 00:16.040] Ride, ride, turn off the lights
[00:16.040 --> 00:18.060] We gonna lose our minds tonight
[00:22.140 --> 00:24.560] I love when it's all too much
[00:24.560 --> 00:26.720] 5am turn the radio up
[00:26.720 --> 00:28.740] It's still rock and roll
Processing segment at 00:30.000
[00:32.280 --> 00:34.000] You won't be crashing
[00:34.540 --> 00:36.460] Fanny snatchin'
[00:37.500 --> 00:38.800] Call me up
[00:38.800 --> 00:40.580] If you were a gangster
[00:40.580 --> 00:43.120] You won't be fancy
[00:43.120 --> 00:44.860] Just get dizzy
[00:44.860 --> 00:48.240] I said serious
[00:56.280 --> 00:58.380] So raise your glass
Processing segment at 00:58.380
[00:58.380 --> 00:59.440] Take you home

Transcription speed: 15.97 audio seconds/s
Subtitles are written to 'd:\aa' directory.

Operation finished in: 0:00:09.668
Quote
10th Jan 2025 04:10 #18
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
For my tests I cut audio after "Why so serious?".

Btw, the real lyrics for the reference:

Code:

Boys, fire it up! Right, right, turn off the lights We gonna lose our minds tonight I love when it's all too much 5 AM turn the radio up Where's the rock and roll? Party crasher, panty snatcher Call me up if you a gangsta Don't be fancy, just get dancy Why so serious?
InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation
Quote
10th Jan 2025 04:34 #19
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
Originally Posted by VoodooFX

It's not implemented in Faster-Whisper-XXL.

So how am I supposed to use it then?

Quote
10th Jan 2025 04:36 #20
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
Originally Posted by koberulz

So how am I supposed to use it then?

https://github.com/lucidrains/BS-RoFormer

InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation

Quote
10th Jan 2025 06:11 #21
koberulz

View Profile

View Forum Posts

Private Message

Visit Homepage
Member

Join Date
Oct 2006

Location
Australia
Originally Posted by VoodooFX

Originally Posted by koberulz

So how am I supposed to use it then?

https://github.com/lucidrains/BS-RoFormer

I'm not a programmer, I have no idea what anything on that page means.

Quote
11th Jan 2025 23:12 #22
loninappleton

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2005

Location
USA
Originally Posted by VoodooFX

These are parameters, in the same window where is "Generate" there is "Advanced" button, you add parameters there.

On the adding parameters (advanced) in SE can you show the code in a sample screen shot to see the format.

I would be interested to know if I can get some code from a post here and paste it into SE as is.

Quote
12th Jan 2025 10:12 #23
VoodooFX

View Profile

View Forum Posts

Private Message
Video Damager

Join Date
Oct 2021

Location
At Doom9
Just add them separated with space.

InpaintDelogo - advanced logo removal & hardcoded subtitles extraction
Standalone Faster-Whisper - Portable AI auto-transcription-translation

Quote
13th Jan 2025 02:26 #24
loninappleton

View Profile

View Forum Posts

Private Message
Member

Join Date
Jun 2005

Location
USA
Thanks.

I'll try what is posted here since have nowhere to start on it.

Quote

Subtitle Edit and Whisper

Thread Tools

Search Thread

Similar Threads

Improvements to Whisper in Subtitle Edit

New Subtitle Edit with Faster Whisper.

Subtitle Edit using whisper no English

Whisper engines in Subtitle Edit

Subtitle Edit 3.6.10 new version with Whisper option