I tried FasterWhisper XXL, but it didn't work very well. CPP works much better (it's a concert film and FWXXL skips over most of the singing but CPP is accurately transcribing lyrics), but it's taking about 6x the video length. My GPU, meanwhile, is sitting on 5%, according to Task Manager's Performance tab, which indicates that it's not doing much of anything.
So is there a setting I'm missing to make it use the GPU? Also, is there a setting to allow it to create SDH subs? My understanding is that Whisper is capable of doing this, but there doesn't seem to be anything within Subtitle Edit for that.
I'm running Win10 with an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz and NVIDIA GeForce GTX 1060 6GB.
+ Reply to Thread
Results 1 to 24 of 24
-
-
-
-
These are parameters, in the same window where is "Generate" there is "Advanced" button, you add parameters there.
-
What's VAD? What do these actually do?
I'm more confused on the SDH thing, because I canceled the CPP process that was running and it actually had SDH in it, which the previous CPP pass (which froze) didn't. -
-
Ah, yes, it's mere coincidence that it keeps hallucinating the subtitle "[APPLAUSE]" every time the crowd applauds, and then "[MUSIC]" when the band starts playing.
Looks like FasterWhisper with your suggested settings is going to take about 24 hours to run, which is a bit more than the 10 hours it was going to take CPP. -
-
Well the FasterWhisper pass with your suggested settings completed well ahead of its initial 24-hour estimate...and was just as bad as the first time. Most lyrics are absent entirely, and the few lines that did get transcribed are just absolute nonsense.
Weirdly, I looked at the second CPP pass - the one with the SDH subs - and it didn't subtitle any of the lyrics either. No idea why it did so well on the lyrics the first time around. Unless it wasn't CPP, but one of the other engines? But again, that was going to take 12 hours, and the GPU didn't seem to be doing anything, so I'm not sure what the story is there. -
SE estimations can be wildly inaccurate, you would need to use the tools directly to see the accurate ETAs.
If it didn't helped then most likely your audio is not good for transcription. What model did you use?
Cut and share a minute sample of that singing.
Whisper model is non-deterministic by default, you can have completely different results on every run, especially on bad material. -
Well, it worked once, perfectly, across the entire opening song. I didn't check further than that, and it appears I've accidentally deleted the results of that pass at some point. But it's not like it only picked up a line or two.
What model did you use?
Sample attached. -
ff_mdx_kim2 doesn't work well on this audio, don't use it.
--model large-v2 -vad false:
Code:[00:01.000 --> 00:03.400] Boys, fire it up! [00:13.740 --> 00:16.060] Right, right, turn off the lights [00:16.060 --> 00:18.100] We're gonna lose our minds tonight [00:22.180 --> 00:24.580] I love this all too much [00:24.580 --> 00:26.760] 5 AM turn the radio off [00:26.760 --> 00:28.740] There's no rock and roll [00:30.000 --> 00:33.380] We're gonna lose our minds tonight [00:33.380 --> 00:36.540] Money-crashin', fanny-snatchin' [00:37.260 --> 00:40.700] Call me up if you a gangsta [00:40.700 --> 00:45.000] Call me fancy, just get dancy [00:45.520 --> 00:48.840] And I will say yes
Code:[00:00.790 --> 00:03.380] Boys, fire it up! [00:13.900 --> 00:18.100] Ride, ride, turn off the lights We're gonna lose our minds tonight [00:22.460 --> 00:26.750] I love this all too much 5 AM turn the radio on [00:26.750 --> 00:29.350] There's no rock and roll [00:33.720 --> 00:40.560] Snatch it, fanny snatch it Call me up if you a gangster [00:41.560 --> 00:44.560] Call me a pansy, just get dancy [00:45.840 --> 00:47.980] I'm so serious
Last edited by VoodooFX; 10th Jan 2025 at 01:50.
-
-
It pre-process audio with the voice extraction model.
I heard that reading help is the good start to know things.
EDIT:
Instead of "ff_mdx_kim2" I tried BS-RoFormer (which is the current SOTA for vocal extraction, "mdx_kim2" was one of the best in 2023):
Code:[00:00.780 --> 00:03.400] Boys, fire it up! [00:13.630 --> 00:16.050] Right, right, turn off the lights [00:16.050 --> 00:18.150] We're gonna lose our minds tonight [00:22.510 --> 00:24.590] I love this all too much [00:24.590 --> 00:26.830] 5 AM, turn the radio off [00:26.830 --> 00:29.030] There's the rock'n'roll [00:33.060 --> 00:36.260] Fanny Crasher, Fanny Snatcher [00:37.140 --> 00:40.660] Call me up if you a gangsta [00:41.560 --> 00:45.000] Don't be flancy, just get dancy [00:45.660 --> 00:48.040] Why so serious?
Last edited by VoodooFX; 10th Jan 2025 at 04:06.
-
I have no idea what that means.
I heard that reading help is the good start to know things.
Instead of "ff_mdx_kim2" I tried BS-RoFormer (which is the current SOTA for vocal extraction, "mdx_kim2" was one of the best in 2023) -
-
This is using Faster-Whisper-XXL r239.1
https://github.com/Purfview/whisper-standalone-win/releases/tag/Faster-Whisper-XXL
This is the command line script I used.
Code:"D:\Whisper-XXL\faster-whisper-xxl.exe" "D:\aa\sample.mkv" --model large-v3-turbo --vad_filter false --sentence --verbose true -o source
I don't have a particularly powerful video card (see in red)
Standalone Faster-Whisper-XXL r239.1 running on: CUDA
Number of visible GPU devices: 1
Supported compute types by GPU: {'int8_float16', 'int8', 'float16', 'int8_float32', 'float32'}
Note: 'large-v3' model may produce worse results than 'large-v2'!
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] CPU: GenuineIntel (SSE4.1=true, AVX=true, AVX2=true, AVX512=false)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Selected ISA: AVX2
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Use Intel MKL: true
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - SGEMM backend: MKL (packed: false)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - GEMM_S16 backend: MKL (packed: false)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - GEMM_S8 backend: MKL (packed: false, u8s8 preferred: true)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] GPU #0: NVIDIA GeForce GTX 1650 SUPER (CC=7.5)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Allow INT8: true
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Allow FP16: true (with Tensor Cores: true)
[2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Allow BF16: false
[2025-01-10 19:42:11.958] [ctranslate2] [thread 8056] [info] Using CUDA allocator: cuda_malloc_async
[2025-01-10 19:42:12.186] [ctranslate2] [thread 8056] [info] Loaded model D:\Whisper-XXL\_models\faster-whisper-large-v3-turbo on device cuda:0
[2025-01-10 19:42:12.186] [ctranslate2] [thread 8056] [info] - Binary version: 6
[2025-01-10 19:42:12.187] [ctranslate2] [thread 8056] [info] - Model specification revision: 3
[2025-01-10 19:42:12.187] [ctranslate2] [thread 8056] [info] - Selected compute type: int8_float32
Faster-Whisper's large-v3-turbo model loaded in: 4.14 seconds
Starting sequential inference to transcribe: d:\aa\sample.mkv
Processing audio with duration 01:00.000
Detecting language using up to the first 30 seconds. Use `--language` to specify the language.
[2025-01-10 19:42:13.225] [ctranslate2] [thread 12648] [info] Loaded cuBLAS library version 12.1.3
Detected language 'English' with probability 0.97
Processing segment at 00:00.000
[00:00.000 --> 00:03.360] Boys, fire it up!
[00:13.640 --> 00:16.040] Ride, ride, turn off the lights
[00:16.040 --> 00:18.060] We gonna lose our minds tonight
[00:22.140 --> 00:24.560] I love when it's all too much
[00:24.560 --> 00:26.720] 5am turn the radio up
[00:26.720 --> 00:28.740] It's still rock and roll
Processing segment at 00:30.000
[00:32.280 --> 00:34.000] You won't be crashing
[00:34.540 --> 00:36.460] Fanny snatchin'
[00:37.500 --> 00:38.800] Call me up
[00:38.800 --> 00:40.580] If you were a gangster
[00:40.580 --> 00:43.120] You won't be fancy
[00:43.120 --> 00:44.860] Just get dizzy
[00:44.860 --> 00:48.240] I said serious
[00:56.280 --> 00:58.380] So raise your glass
Processing segment at 00:58.380
[00:58.380 --> 00:59.440] Take you home
Transcription speed: 15.97 audio seconds/s
Subtitles are written to 'd:\aa' directory.
Operation finished in: 0:00:09.668 -
For my tests I cut audio after "Why so serious?".
Btw, the real lyrics for the reference:
Code:Boys, fire it up! Right, right, turn off the lights We gonna lose our minds tonight I love when it's all too much 5 AM turn the radio up Where's the rock and roll? Party crasher, panty snatcher Call me up if you a gangsta Don't be fancy, just get dancy Why so serious?
-
-
-
Just add them separated with space.
Similar Threads
-
Improvements to Whisper in Subtitle Edit
By loninappleton in forum SubtitleReplies: 20Last Post: 15th Jan 2025, 18:40 -
New Subtitle Edit with Faster Whisper.
By loninappleton in forum SubtitleReplies: 5Last Post: 25th Sep 2023, 19:02 -
Subtitle Edit using whisper no English
By Albertos22 in forum Newbie / General discussionsReplies: 2Last Post: 9th Sep 2023, 13:08 -
Whisper engines in Subtitle Edit
By loninappleton in forum SubtitleReplies: 0Last Post: 17th May 2023, 00:20 -
Subtitle Edit 3.6.10 new version with Whisper option
By loninappleton in forum SubtitleReplies: 33Last Post: 18th Dec 2022, 15:24