VideoHelp Forum
+ Reply to Thread
Results 1 to 4 of 4
Thread
  1. I've been using Subtitle Edit for years to import PGS subs using the OCR import but recently stumbled on a strange problem. In some rare but reproducible instances, SE "corrects" mostly good output from Tesseract with gibberish, as shown below:

    Image
    [Attachment 48487 - Click to enlarge]


    I've sent Nikolaj the subtitle file but it works fine for him. Can someone try to import the attached mks excerpt to see if you get the same behavior I'm seeing? Thanks in advance!

    Interestingly, if I select "None" for the dictionary, then the line isn't munged. This isn't desirable for me though since I rely on the unknown words list to verify OCR accuracy against the images. I'm running 3.5.9 on Windows 10 and have tried clearing out all my settings to no avail.
    Image Attached Files
    Quote Quote  
  2. Member
    Join Date
    Aug 2018
    Location
    Denmark
    Search PM
    OCR with Tesseract v3.02
    1
    -00:00:00,001 --> 00:00:01,208
    Or engaged.

    2
    00:00:03,085 --> 00:00:05,129
    Qflta stupid

    3
    00:00:05,212 --> 00:00:10,212
    My brother's marrying Leta,
    June the 6th.
    OCR with Tesseract v4.00
    1
    -00:00:00,001 --> 00:00:01,208
    Or engaged.

    2
    00:00:03,085 --> 00:00:05,129
    - What?
    - It was a mistake in a stupid magazine.

    3
    00:00:05,212 --> 00:00:10,212
    My brother's marrying Leta,
    June the 6th.
    Quote Quote  
  3. Thanks for looking at this Metti--it's helpful to know I'm not the only one experiencing this weirdness. I haven't played with Tesseract 4 yet, but I just downloaded it and got the same results as you did. I did notice it seems to be much slower than Tesseract 3 though. I haven't been able to find much guidance on the differences between 3 and 4 as it pertains to Subtitle Edit. Can anyone recommend reasons for or against upgrading to 4?

    Tesseract versions aside, it seems Subtitle Edit is doing something incorrect when it changes what Tesseract read correctly:

    Code:
    -What? -lt was a mistake in a stupid magazine. -> Qflta stupid
    Quote Quote  
  4. I've been doing a lot of testing with Tesseract 4.0, and I've found that overall it's much more accurate than 3.02, although is a lot slower and more memory intensive. However, I've found a number of instances where it completely fails to see some words and phrases. See attached files for some examples.

    In the case of "single letters.mkv", the first four lines are completely missed (they should spell "LATE").

    Image
    [Attachment 48682 - Click to enlarge]


    Second example "missing words.mkv"

    Image
    [Attachment 48683 - Click to enlarge]


    Code:
    1
    - Hey.
    
    2
    - Don't know!
    - Whatever else you do.
    
    3
    - Told you he wouldn't be long.
    - Hi! wanted to surprise you!
    
    4
    - Mean, tears are rolling down my cheeks...
    - Of course you start crying.
    
    5
    - Hello
    Subs 1 and 5 are missing a 2nd line that reads "-Hi.", and the others are missing the word "I" at various places. The "Original Tesseract only" engine mode sometimes will catch more of the missing I's, but all of the engine modes miss the Hi.

    Using 3.02, it's almost perfect except sub 5 is also missing the Hi:

    Code:
    1
    - Hey.
    - Hi.
    
    2
    - I don't know!
    - Whatever else you do.
    
    3
    - Told you he wouldn't be long.
    - Hi! I wanted to surprise you!
    
    4
    - I mean, tears are rolling down my cheeks...
    - Of course you start crying.
    
    5
    - Hello.
    Can someone confirm these results? I've tried with 3.5.9 as well as the latest beta. Aside from these issues, I much prefer the accuracy of Tesseract 4.0 over 3.02. Are there other options I can tweak that may help? Thanks for any input!
    Image Attached Files
    Quote Quote  



Similar Threads

Visit our sponsor! Try DVDFab and backup Blu-rays!