VideoHelp Forum




+ Reply to Thread
Page 3 of 3
FirstFirst 1 2 3
Results 61 to 74 of 74
  1. Member
    Join Date
    May 2008
    Location
    France
    Search Comp PM
    Thanks for the merge python script, I'll try that.

    You can also merge with SE: tools -> join subtitles

    Just add the 2 subs and click "join".

    Then: tools > Sort > By start time

    Looking at your subs I see what happens: the missing subs come from the merge, but that has a penalty: there are parts where you have both subs generated by whisper and those merged for the same dialog.

    Example starting at 47:42:507 for E02 (timestamp with my version):

    One sub with 2 lines:

    Code:
    I always believed that I
    had a future in this country.
    And then 2 subs with 2 lines and one line:

    Code:
    I always believed
    we had a future
    Code:
    in this country.
    That's a bit of a mess!

    About syncing subs.

    Most US shows are designed to have ads. These ads are generally inserted at scene changes, when the video goes dark briefly.

    2 different versions of a same show (web-dl and FOX edited to remove ads for example) will show a timing discrepancy at the scene changes.

    My process is:

    1) Use comskip, a tool to find ads, to detect scene changes to produce a VideoRedo .vprj file.

    I use:

    comskip82_010_donators\comskip.exe --threads=20 --videoredo --detectmethod=95 --verbose=0 "Bones S01-E01.mkv"

    Even if the adds have already been removed, comskip usually finds where

    2) Open the VideoRedo project. VideoRedo will clearly show where the ads where. It will show many false positions, but using F6 to navigate from one to the next shows the image, and when it's dark, it's likely where the adds were.

    Example with an old FOX show where the adds were already removed:

    Image
    [Attachment 89434 - Click to enlarge]


    Use SE to adjust at the beginning using "Set start and offset the rest". Then navigate to the next scene change that you see with VideoRedo. Check before the timestamp if it's in sync (usually it is) and there after. If it's not, sync it at this location with ""Set start and offset the rest"".

    That does not work with all the shows (not with Kabul for example that never had embedded ads anyway), but in my experience with many of them.

    Not sure if the donator version of comskip is still available. Or if it's actually needed.

    VideoRedo is no longer sold, it's hard to activate it now but I read somewhere that somebody has the rights to it now and can provide a legal way. You would need to do a search for that.

    If I get more information, I'll send you a PM.
    Quote Quote  
  2. Example starting at 47:42:507 for E02 (timestamp with my version):

    One sub with 2 lines:

    Code:
    I always believed that I
    had a future in this country.
    And then 2 subs with 2 lines and one line:

    Code:
    I always believed
    we had a future
    Code:
    in this country.
    SE engine is the culprit. No problems with CMD>CLI

    Image
    [Attachment 89439 - Click to enlarge]


    Doing the transcription again.Will update you with new subs soon.

    Meanwhile, read my PM
    Quote Quote  
  3. Member
    Join Date
    May 2008
    Location
    France
    Search Comp PM
    Originally Posted by sam12345 View Post
    Uploaded complete clean subs of Kabul (Mini TV Series) 2025


    https://www.opensubtitles.org
    Great, that will save me the trouble doing all the subs myself.

    I actually tried to upload the Assembly ones, they were rejected because it's too obvious that they are AI generated.

    Edit:

    But there are still subs that overlap:

    Image
    [Attachment 89446 - Click to enlarge]
    Last edited by robena; 29th Oct 2025 at 12:53.
    Quote Quote  
  4. Will update the code of assembly tommorow.
    Quote Quote  
  5. Member
    Join Date
    May 2008
    Location
    France
    Search Comp PM
    Hi,

    So, here is a perfect way for this Kabul series.

    This uses my way to number episodes: "Kabul S01-E01.mkv"

    This will likely fail if using another convention such as " "Kabul S01E01.mkv"

    1) Use Faster-Whisper-XXL_r245.1_windows with this model (based on a REXX script, easy to transcribe for something else):


    Code:
                                                                                                                   
    /* Just in case you want something else than English */
    if lang = 'en' then do
        task = ' --task translate'
        prompt = ' --initial_prompt "Translate everything to English."'
    end
    else do
        task = ' --task transcribe'
        prompt = ' --initial_prompt "Transcribe in 'lang'."'
    end
                                                                                                                   
    '--model large-v3' ,
    task ,
    ' --language 'lang ,
    prompt ,
    ' --device cuda' ,
    ' --compute_type float16' ,
    ' --batch_size 8' ,
    ' --vad_method pyannote_onnx_v3' ,
    ' --vad_device cuda' ,
    ' --beep_off',
    ' --vad_threshold 0.1' ,                 /* ULTRA LOW */
    ' --vad_min_speech_duration_ms 50' ,     /* 50ms = catch whispers */
    ' --vad_min_silence_duration_ms 100' ,   /* tighter gaps */
    ' --hallucination_silence_threshold 0.6' ,
    ' --no_speech_threshold 0.1' ,           /* catch ANY speech */
    ' --logprob_threshold -2.0' ,            /* keep low-conf */
    ' --compression_ratio_threshold 2.4' ,
    ' --beam_size 5' ,
    ' --best_of 5' ,
    ' --temperature 0' ,
    ' --repetition_penalty 1.1' ,
    ' --no_repeat_ngram_size 3' ,
    ' --condition_on_previous_text False' ,
    ' --word_timestamps True' ,
    ' --output_format all' ,                 /* JSON + SRT */
    ' --output_dir "'fdd(file)'"'
    That outputs a file such as "Kabul S01-E01.srt" that my REXX script renames to "Kabul S01-E01-en.srt"

    *** Having a filename ending with "-something" is necessary.

    Then, store in the same directory: "Kabul S01-E01-F.srt"

    These are the forced subs for the foreign dialogs.

    *** Having '-F' is necessary.

    Whisper has translated these foreign dialogs, but we want those that are already in "Kabul S01-E01-F.srt".

    To get that, merge with this python script:

    Code:
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    """
    merge_subtitles.py -
    - -F.srt loaded FIRST
    - Every forced sub PRESERVED exactly
    - Whisper fills gaps OR is replaced on overlap
    - No duplicates, no loss
    """
     
    import sys
    from pathlib import Path
    from typing import List, Tuple
     
    def time_to_seconds(t: str) -> float:
        h, m, s_ms = t.split(":")
        s, ms = s_ms.replace(",", ".").split(".")
        return int(h) * 3600 + int(m) * 60 + float(s) + float(ms) / 1000
     
    def seconds_to_srt(sec: float) -> str:
        h = int(sec // 3600)
        m = int((sec % 3600) // 60)
        s = sec % 60
        return f"{h:02}:{m:02}:{s:06.3f}".replace(".", ",")[:12]
     
    def parse_srt_robust(content: str, filename: str) -> List[Tuple[float, float, List[str], str]]:
        entries = []
        lines = content.splitlines()
        i = 0
        while i < len(lines):
            if lines[i].strip().isdigit():
                i += 1
                if i >= len(lines): break
                time_line = lines[i].strip()
                if "-->" not in time_line:
                    i += 1
                    continue
                try:
                    start_str, end_str = time_line.split("-->", 1)
                    start = time_to_seconds(start_str.strip())
                    end = time_to_seconds(end_str.strip())
                except:
                    i += 1
                    continue
                i += 1
                text_lines = []
                while i < len(lines) and lines[i].strip() and not lines[i].strip().isdigit():
                    text_lines.append(lines[i].strip())
                    i += 1
                if text_lines:
                    entries.append((start, end, text_lines, filename))
            else:
                i += 1
        return entries
     
    def is_forced(filename: str) -> bool:
        return any(k in filename.lower() for k in ("-f.", "-forced", ".f.", "forced"))
     
    def main(mkv_path: str) -> None:
        mkv = Path(mkv_path)
        if not mkv.exists():
            print(f"[ERROR] File not found: {mkv}")
            sys.exit(1)
     
        folder = mkv.parent
        base_name = mkv.stem
        output_srt = folder / f"{base_name}.srt"
     
        if output_srt.exists():
            print("Skipping merge, file exists")
            return
     
        srt_files = list(folder.glob(f"{base_name}*.srt"))
        if not srt_files:
            print(f"[INFO] No SRT files")
            return
     
        forced_file = next((f for f in srt_files if is_forced(f.name)), None)
        whisper_file = next((f for f in srt_files if not is_forced(f.name)), None)
     
        if not forced_file:
            print("[ERROR] No -F.srt found!")
            return
     
        print(f"[INFO] Forced: {forced_file.name}")
        print(f"[INFO] Whisper: {whisper_file.name if whisper_file else 'None'}")
     
        # Parse forced
        try:
            forced_text = forced_file.read_text(encoding="utf-8", errors="replace")
            forced_subs = parse_srt_robust(forced_text, forced_file.name)
            print(f"    ? {len(forced_subs)} forced lines")
        except Exception as e:
            print(f"[ERROR] Failed to read forced: {e}")
            return
     
        # Parse whisper
        whisper_subs = []
        if whisper_file:
            try:
                whisper_text = whisper_file.read_text(encoding="utf-8", errors="replace")
                whisper_subs = parse_srt_robust(whisper_text, whisper_file.name)
                print(f"    ? {len(whisper_subs)} whisper lines")
            except Exception as e:
                print(f"[WARNING] Whisper failed: {e}")
     
        forced_subs.sort(key=lambda x: x[0])
        whisper_subs.sort(key=lambda x: x[0])
     
        final_subs = []
        w_idx = 0
        W = len(whisper_subs)
     
        for f_start, f_end, f_lines, _ in forced_subs:
            # Add all Whisper subs that END before this forced sub starts
            while w_idx < W:
                w_start, w_end, _, _ = whisper_subs[w_idx]
                if w_end <= f_start:  # No overlap
                    final_subs.append(whisper_subs[w_idx][:3])
                    w_idx += 1
                else:
                    break
     
            # Now: skip all Whisper subs that overlap this forced sub
            while w_idx < W:
                w_start, w_end, _, _ = whisper_subs[w_idx]
                if w_start < f_end:  # Overlaps
                    w_idx += 1
                else:
                    break
     
            # Add forced sub
            final_subs.append((f_start, f_end, f_lines))
     
        # Add remaining non-overlapping whisper subs
        while w_idx < W:
            final_subs.append(whisper_subs[w_idx][:3])
            w_idx += 1
     
        # Write output
        with open(output_srt, "w", encoding="utf-8") as f:
            for idx, (start, end, lines) in enumerate(final_subs, 1):
                f.write(f"{idx}\n")
                f.write(f"{seconds_to_srt(start)} --> {seconds_to_srt(end)}\n")
                for line in lines:
                    f.write(f"{line}\n")
                f.write("\n")
     
        print(f"\n[OK] Merged {len(final_subs)} blocks ? {output_srt.name}")
        print(f"    ? {len(forced_subs)} forced subs preserved (100%)")
     
    if __name__ == "__main__":
        if len(sys.argv) != 2:
            print("Usage: merge_subtitles.py <mkv_path>")
            sys.exit(1)
        main(sys.argv[1])
    that will produce "Kabul S01-E01.srt"

    It will contain:

    - English dialogs transcribed in English
    - Foreign dialogs that are already in the '-F' forced subs "as is".
    - Foreign dialogs missing in the '-F' forced subs translated by whisper.

    I did all that with a mix of ChatGPT, Grok and Deepseek.

    Here are all the episode batch processed:

    https://limewire.com/d/9JGEX#qH5IhI1Y1x


    Keep in mind that my SE settings are different than yours, so formating might not be 100% to your liking.

    Also, it seems that you don't have exactly the same version, you will likely need to sync the start of the subs.
    Last edited by robena; 29th Oct 2025 at 20:47.
    Quote Quote  
  6. No.The lines are too big to read. Update the code with

    --max_line_count 2 ^
    --max_line_width 36 ^
    Image Attached Thumbnails Click image for larger version

Name:	Screenshot_25.png
Views:	4
Size:	1.60 MB
ID:	89454  

    Click image for larger version

Name:	Screenshot_26.png
Views:	6
Size:	1.84 MB
ID:	89455  

    Click image for larger version

Name:	Screenshot_27.png
Views:	4
Size:	1.76 MB
ID:	89456  

    Quote Quote  
  7. Member
    Join Date
    May 2008
    Location
    France
    Search Comp PM
    Not too many lines like that, but I'll try and compare the results, that's easy!

    The way I did it with a REXX script, it's just a right click to do everything.
    Quote Quote  
  8. Updated Assembly Code [Assembly.c]
    Perfect Two lines break. No need for SE batch for touchup.
    Image Attached Files
    Quote Quote  
  9. Member
    Join Date
    May 2008
    Location
    France
    Search Comp PM
    I did not test Assembly yet, but for whisper, I am surprised to see the difference it makes between --max_line_width 36 and --max_line_width 40, it's not only 4 characters.

    I asked why, and got:

    --max_line_width does not mean “no line may be longer than N characters”.
    It tells Faster-Whisper the target width that the line-breaker tries to stay under while it is splitting a segment into subtitle lines.
    Because the breaker also respects sentence boundaries, words that are already > N, and minimum-line-length rules, you will still see lines that are much longer than the value you passed – especially when you raise it from 36 to 40.

    Thanks for pointing it out, I would never had thought by myself that it would make the subs that better.

    Here they are:

    https://limewire.com/d/ZpjSg#tNvqHhY3Gl

    Whisper gives much more natural looking subs than Assembly. opensubtitles.org rejects Assembly ones.

    Edit: I'll reserve judgment until I test your version!

    Edit edit: I get "Failed to upload file." with your version. Firewall was open. Don't waste time on it for me, I'll use whisper from now on.

    Edit edit edit: stupid, I forgot to update the API key!!!
    Last edited by robena; 30th Oct 2025 at 03:50.
    Quote Quote  
  10. Member
    Join Date
    May 2008
    Location
    France
    Search Comp PM
    Sam,

    I remux my subs using a routine that detects the aspect ratio and creates PGS files that are located inside the active video region, just a few pixels above the black bar:

    Image
    [Attachment 89467 - Click to enlarge]


    Interested?
    Quote Quote  
  11. Ofcource YES.


    The way I did it with a REXX script, it's just a right click to do everything.
    Kindly PM your code for Faster-Whisper-XXL that you used
    Quote Quote  
  12. Tehran [Season 03] KAN (Color-Yellow) ENG-NON Hi

    subs uploaded > opensubtitles.org [uploader-SamGer]
    Last edited by sam12345; 30th Oct 2025 at 06:48.
    Quote Quote  
  13. Member
    Join Date
    May 2008
    Location
    France
    Search Comp PM
    Originally Posted by sam12345 View Post
    Tehran [Season 03] KAN (Color-Yellow) ENG-NON Hi

    subs uploaded > opensubtitles.org [uploader-SamGer]
    Waiting for the UHD version...
    Quote Quote  



Similar Threads

Visit our sponsor! Try DVDFab and backup Blu-rays!