VideoHelp Forum




+ Reply to Thread
Results 1 to 24 of 24
  1. I tried FasterWhisper XXL, but it didn't work very well. CPP works much better (it's a concert film and FWXXL skips over most of the singing but CPP is accurately transcribing lyrics), but it's taking about 6x the video length. My GPU, meanwhile, is sitting on 5%, according to Task Manager's Performance tab, which indicates that it's not doing much of anything.

    So is there a setting I'm missing to make it use the GPU? Also, is there a setting to allow it to create SDH subs? My understanding is that Whisper is capable of doing this, but there doesn't seem to be anything within Subtitle Edit for that.

    I'm running Win10 with an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz and NVIDIA GeForce GTX 1060 6GB.
    Quote Quote  
  2. Video Damager VoodooFX's Avatar
    Join Date
    Oct 2021
    Location
    At Doom9
    Search PM
    Originally Posted by koberulz View Post
    I tried FasterWhisper XXL, but it didn't work very well.
    Because you didn't set the proper settings for your audio. For the concerts you want to use better VAD:
    Code:
    --vad_method pyannote_v3
    and background noise removal:
    Code:
    --ff_mdx_kim2
    Not sure that by default 6GB will be enough for mdx_kim2, report how it went.
    Quote Quote  
  3. Originally Posted by VoodooFX View Post
    Originally Posted by koberulz View Post
    I tried FasterWhisper XXL, but it didn't work very well.
    Because you didn't set the proper settings for your audio. For the concerts you want to use better VAD:
    Code:
    --vad_method pyannote_v3
    and background noise removal:
    Code:
    --ff_mdx_kim2
    Not sure that by default 6GB will be enough for mdx_kim2, report how it went.
    I have no idea how I'm supposed to even set these? Or what they mean? I just pulled up the dialog in SE and clicked "generate".
    Quote Quote  
  4. Video Damager VoodooFX's Avatar
    Join Date
    Oct 2021
    Location
    At Doom9
    Search PM
    These are parameters, in the same window where is "Generate" there is "Advanced" button, you add parameters there.
    Quote Quote  
  5. What's VAD? What do these actually do?

    I'm more confused on the SDH thing, because I canceled the CPP process that was running and it actually had SDH in it, which the previous CPP pass (which froze) didn't.
    Quote Quote  
  6. Video Damager VoodooFX's Avatar
    Join Date
    Oct 2021
    Location
    At Doom9
    Search PM
    Originally Posted by koberulz View Post
    What's VAD?
    It's voice detection, detected voice is passed to whisper to transcribe, non-detected areas are skipped.


    Originally Posted by koberulz View Post
    I'm more confused on the SDH thing...
    Original whisper models are not meant to do any "SDH" thing, you are observing so-called hallucinations.
    Quote Quote  
  7. Originally Posted by VoodooFX View Post
    Original whisper models are not meant to do any "SDH" thing, you are observing so-called hallucinations.
    Ah, yes, it's mere coincidence that it keeps hallucinating the subtitle "[APPLAUSE]" every time the crowd applauds, and then "[MUSIC]" when the band starts playing.

    Looks like FasterWhisper with your suggested settings is going to take about 24 hours to run, which is a bit more than the 10 hours it was going to take CPP.
    Quote Quote  
  8. Video Damager VoodooFX's Avatar
    Join Date
    Oct 2021
    Location
    At Doom9
    Search PM
    Originally Posted by koberulz View Post
    Ah, yes, it's mere coincidence that it keeps hallucinating the subtitle "[APPLAUSE]" every time the crowd applauds, and then "[MUSIC]" when the band starts playing.
    It's not coincidence that you have no clue what you are talking about.
    Quote Quote  
  9. Well the FasterWhisper pass with your suggested settings completed well ahead of its initial 24-hour estimate...and was just as bad as the first time. Most lyrics are absent entirely, and the few lines that did get transcribed are just absolute nonsense.

    Weirdly, I looked at the second CPP pass - the one with the SDH subs - and it didn't subtitle any of the lyrics either. No idea why it did so well on the lyrics the first time around. Unless it wasn't CPP, but one of the other engines? But again, that was going to take 12 hours, and the GPU didn't seem to be doing anything, so I'm not sure what the story is there.
    Quote Quote  
  10. Video Damager VoodooFX's Avatar
    Join Date
    Oct 2021
    Location
    At Doom9
    Search PM
    Originally Posted by koberulz View Post
    ...completed well ahead of its initial 24-hour estimate...and was just as bad as the first time.
    SE estimations can be wildly inaccurate, you would need to use the tools directly to see the accurate ETAs.

    If it didn't helped then most likely your audio is not good for transcription. What model did you use?
    Cut and share a minute sample of that singing.

    Originally Posted by koberulz View Post
    ...I'm not sure what the story is there.
    Whisper model is non-deterministic by default, you can have completely different results on every run, especially on bad material.
    Quote Quote  
  11. Originally Posted by VoodooFX View Post
    If it didn't helped then most likely your audio is not good for transcription.
    Well, it worked once, perfectly, across the entire opening song. I didn't check further than that, and it appears I've accidentally deleted the results of that pass at some point. But it's not like it only picked up a line or two.

    What model did you use?
    Good point. I think the frozen CPP pass, which worked, was large-v3-turbo, and the other one, which didn't pick up the lyrics, was medium. I tried spinning up large-v3-turbo-q5, which is significantly smaller, via CPP and that's...estimating 24 hours again. The FasterWhisper pass was large-v3.

    Sample attached.
    Image Attached Files
    Quote Quote  
  12. Video Damager VoodooFX's Avatar
    Join Date
    Oct 2021
    Location
    At Doom9
    Search PM
    ff_mdx_kim2 doesn't work well on this audio, don't use it.

    --model large-v2 -vad false:
    Code:
    [00:01.000 --> 00:03.400]  Boys, fire it up!
    [00:13.740 --> 00:16.060]  Right, right, turn off the lights
    [00:16.060 --> 00:18.100]  We're gonna lose our minds tonight
    [00:22.180 --> 00:24.580]  I love this all too much
    [00:24.580 --> 00:26.760]  5 AM turn the radio off
    [00:26.760 --> 00:28.740]  There's no rock and roll
    [00:30.000 --> 00:33.380]  We're gonna lose our minds tonight
    [00:33.380 --> 00:36.540]  Money-crashin', fanny-snatchin'
    [00:37.260 --> 00:40.700]  Call me up if you a gangsta
    [00:40.700 --> 00:45.000]  Call me fancy, just get dancy
    [00:45.520 --> 00:48.840]  And I will say yes
    --model large-v2 --vad_method pyannote_v3:
    Code:
    [00:00.790 --> 00:03.380]  Boys, fire it up!
    [00:13.900 --> 00:18.100]  Ride, ride, turn off the lights We're gonna lose our minds tonight
    [00:22.460 --> 00:26.750]  I love this all too much 5 AM turn the radio on
    [00:26.750 --> 00:29.350]  There's no rock and roll
    [00:33.720 --> 00:40.560]  Snatch it, fanny snatch it Call me up if you a gangster
    [00:41.560 --> 00:44.560]  Call me a pansy, just get dancy
    [00:45.840 --> 00:47.980]  I'm so serious
    Quote Quote  
  13. Originally Posted by VoodooFX View Post
    ff_mdx_kim2 doesn't work well on this audio, don't use it.
    I don't know what it is or what it's doing, how am I supposed to know whether it works well or not?
    Quote Quote  
  14. Video Damager VoodooFX's Avatar
    Join Date
    Oct 2021
    Location
    At Doom9
    Search PM
    Originally Posted by koberulz View Post
    I don't know what it is or what it's doing
    It pre-process audio with the voice extraction model.

    Originally Posted by koberulz View Post
    how am I supposed to know
    I heard that reading help is the good start to know things.


    EDIT:

    Instead of "ff_mdx_kim2" I tried BS-RoFormer (which is the current SOTA for vocal extraction, "mdx_kim2" was one of the best in 2023):
    Code:
    [00:00.780 --> 00:03.400]  Boys, fire it up!
    [00:13.630 --> 00:16.050]  Right, right, turn off the lights
    [00:16.050 --> 00:18.150]  We're gonna lose our minds tonight
    [00:22.510 --> 00:24.590]  I love this all too much
    [00:24.590 --> 00:26.830]  5 AM, turn the radio off
    [00:26.830 --> 00:29.030]  There's the rock'n'roll
    [00:33.060 --> 00:36.260]  Fanny Crasher, Fanny Snatcher
    [00:37.140 --> 00:40.660]  Call me up if you a gangsta
    [00:41.560 --> 00:45.000]  Don't be flancy, just get dancy
    [00:45.660 --> 00:48.040]  Why so serious?
    Btw, Mel-RoFormer shows pretty good results too.
    Quote Quote  
  15. Originally Posted by VoodooFX View Post
    It pre-process audio with the voice extraction model.
    I have no idea what that means.

    I heard that reading help is the good start to know things.
    I don't even know what I'm supposed to be reading!


    Instead of "ff_mdx_kim2" I tried BS-RoFormer (which is the current SOTA for vocal extraction, "mdx_kim2" was one of the best in 2023)
    Do I need to install something else here? The Advanced screen in Subtitle Edit doesn't seem to mention this one.
    Quote Quote  
  16. Video Damager VoodooFX's Avatar
    Join Date
    Oct 2021
    Location
    At Doom9
    Search PM
    Originally Posted by koberulz View Post
    I have no idea what that means.
    C'est la vie.

    Originally Posted by koberulz View Post
    Do I need to install something else here?
    It's not implemented in Faster-Whisper-XXL.
    Quote Quote  
  17. Member
    Join Date
    Apr 2007
    Location
    Australia
    Search Comp PM
    This is using Faster-Whisper-XXL r239.1
    https://github.com/Purfview/whisper-standalone-win/releases/tag/Faster-Whisper-XXL
    This is the command line script I used.
    Code:
    "D:\Whisper-XXL\faster-whisper-xxl.exe" "D:\aa\sample.mkv" --model large-v3-turbo --vad_filter false --sentence --verbose true -o source
    The result looks good. Yes/No? 11.5 seconds. faster-whisper chose to use compute type: int8_float32
    I don't have a particularly powerful video card (see in red)
    Standalone Faster-Whisper-XXL r239.1 running on: CUDA
    Number of visible GPU devices: 1
    Supported compute types by GPU: {'int8_float16', 'int8', 'float16', 'int8_float32', 'float32'}

    Note: 'large-v3' model may produce worse results than 'large-v2'!

    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] CPU: GenuineIntel (SSE4.1=true, AVX=true, AVX2=true, AVX512=false)
    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Selected ISA: AVX2
    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Use Intel MKL: true
    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - SGEMM backend: MKL (packed: false)
    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - GEMM_S16 backend: MKL (packed: false)
    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - GEMM_S8 backend: MKL (packed: false, u8s8 preferred: true)
    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] GPU #0: NVIDIA GeForce GTX 1650 SUPER (CC=7.5)
    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Allow INT8: true
    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Allow FP16: true (with Tensor Cores: true)
    [2025-01-10 19:42:08.138] [ctranslate2] [thread 8056] [info] - Allow BF16: false
    [2025-01-10 19:42:11.958] [ctranslate2] [thread 8056] [info] Using CUDA allocator: cuda_malloc_async
    [2025-01-10 19:42:12.186] [ctranslate2] [thread 8056] [info] Loaded model D:\Whisper-XXL\_models\faster-whisper-large-v3-turbo on device cuda:0
    [2025-01-10 19:42:12.186] [ctranslate2] [thread 8056] [info] - Binary version: 6
    [2025-01-10 19:42:12.187] [ctranslate2] [thread 8056] [info] - Model specification revision: 3
    [2025-01-10 19:42:12.187] [ctranslate2] [thread 8056] [info] - Selected compute type: int8_float32

    Faster-Whisper's large-v3-turbo model loaded in: 4.14 seconds

    Starting sequential inference to transcribe: d:\aa\sample.mkv

    Processing audio with duration 01:00.000

    Detecting language using up to the first 30 seconds. Use `--language` to specify the language.
    [2025-01-10 19:42:13.225] [ctranslate2] [thread 12648] [info] Loaded cuBLAS library version 12.1.3
    Detected language 'English' with probability 0.97

    Processing segment at 00:00.000
    [00:00.000 --> 00:03.360] Boys, fire it up!
    [00:13.640 --> 00:16.040] Ride, ride, turn off the lights
    [00:16.040 --> 00:18.060] We gonna lose our minds tonight
    [00:22.140 --> 00:24.560] I love when it's all too much
    [00:24.560 --> 00:26.720] 5am turn the radio up
    [00:26.720 --> 00:28.740] It's still rock and roll
    Processing segment at 00:30.000
    [00:32.280 --> 00:34.000] You won't be crashing
    [00:34.540 --> 00:36.460] Fanny snatchin'
    [00:37.500 --> 00:38.800] Call me up
    [00:38.800 --> 00:40.580] If you were a gangster
    [00:40.580 --> 00:43.120] You won't be fancy
    [00:43.120 --> 00:44.860] Just get dizzy
    [00:44.860 --> 00:48.240] I said serious
    [00:56.280 --> 00:58.380] So raise your glass
    Processing segment at 00:58.380
    [00:58.380 --> 00:59.440] Take you home

    Transcription speed: 15.97 audio seconds/s
    Subtitles are written to 'd:\aa' directory.

    Operation finished in: 0:00:09.668
    Quote Quote  
  18. Video Damager VoodooFX's Avatar
    Join Date
    Oct 2021
    Location
    At Doom9
    Search PM
    For my tests I cut audio after "Why so serious?".

    Btw, the real lyrics for the reference:

    Code:
    Boys, fire it up!
    Right, right, turn off the lights
    We gonna lose our minds tonight
    
    I love when it's all too much
    5 AM turn the radio up
    Where's the rock and roll?
    
    Party crasher, panty snatcher
    Call me up if you a gangsta
    Don't be fancy, just get dancy
    Why so serious?
    Quote Quote  
  19. Originally Posted by VoodooFX View Post
    It's not implemented in Faster-Whisper-XXL.
    So how am I supposed to use it then?
    Quote Quote  
  20. Originally Posted by VoodooFX View Post
    Originally Posted by koberulz View Post
    So how am I supposed to use it then?
    https://github.com/lucidrains/BS-RoFormer
    I'm not a programmer, I have no idea what anything on that page means.
    Quote Quote  
  21. Originally Posted by VoodooFX View Post
    These are parameters, in the same window where is "Generate" there is "Advanced" button, you add parameters there.
    On the adding parameters (advanced) in SE can you show the code in a sample screen shot to see the format.

    I would be interested to know if I can get some code from a post here and paste it into SE as is.
    Quote Quote  
  22. Thanks.

    I'll try what is posted here since have nowhere to start on it.
    Quote Quote  



Similar Threads

Visit our sponsor! Try DVDFab and backup Blu-rays!