I've been using Subtitle Edit for years to import PGS subs using the OCR import but recently stumbled on a strange problem. In some rare but reproducible instances, SE "corrects" mostly good output from Tesseract with gibberish, as shown below:
[Attachment 48487 - Click to enlarge]
I've sent Nikolaj the subtitle file but it works fine for him. Can someone try to import the attached mks excerpt to see if you get the same behavior I'm seeing? Thanks in advance!
Interestingly, if I select "None" for the dictionary, then the line isn't munged. This isn't desirable for me though since I rely on the unknown words list to verify OCR accuracy against the images. I'm running 3.5.9 on Windows 10 and have tried clearing out all my settings to no avail.
Try StreamFab Downloader and download from Netflix, Amazon, Youtube! Or Try DVDFab and copy Blu-rays! or rip iTunes movies!
+ Reply to Thread
Results 1 to 4 of 4
Thread
-
-
OCR with Tesseract v3.02
1
-00:00:00,001 --> 00:00:01,208
Or engaged.
2
00:00:03,085 --> 00:00:05,129
Qflta stupid
3
00:00:05,212 --> 00:00:10,212
My brother's marrying Leta,
June the 6th.
1
-00:00:00,001 --> 00:00:01,208
Or engaged.
2
00:00:03,085 --> 00:00:05,129
- What?
- It was a mistake in a stupid magazine.
3
00:00:05,212 --> 00:00:10,212
My brother's marrying Leta,
June the 6th. -
Thanks for looking at this Metti--it's helpful to know I'm not the only one experiencing this weirdness. I haven't played with Tesseract 4 yet, but I just downloaded it and got the same results as you did. I did notice it seems to be much slower than Tesseract 3 though. I haven't been able to find much guidance on the differences between 3 and 4 as it pertains to Subtitle Edit. Can anyone recommend reasons for or against upgrading to 4?
Tesseract versions aside, it seems Subtitle Edit is doing something incorrect when it changes what Tesseract read correctly:
Code:-What? -lt was a mistake in a stupid magazine. -> Qflta stupid
-
I've been doing a lot of testing with Tesseract 4.0, and I've found that overall it's much more accurate than 3.02, although is a lot slower and more memory intensive. However, I've found a number of instances where it completely fails to see some words and phrases. See attached files for some examples.
In the case of "single letters.mkv", the first four lines are completely missed (they should spell "LATE").
[Attachment 48682 - Click to enlarge]
Second example "missing words.mkv"
[Attachment 48683 - Click to enlarge]
Code:1 - Hey. 2 - Don't know! - Whatever else you do. 3 - Told you he wouldn't be long. - Hi! wanted to surprise you! 4 - Mean, tears are rolling down my cheeks... - Of course you start crying. 5 - Hello
Using 3.02, it's almost perfect except sub 5 is also missing the Hi:
Code:1 - Hey. - Hi. 2 - I don't know! - Whatever else you do. 3 - Told you he wouldn't be long. - Hi! I wanted to surprise you! 4 - I mean, tears are rolling down my cheeks... - Of course you start crying. 5 - Hello.
Similar Threads
-
Downloading new OCR dictionary to Subtitle Edit 3.5.4
By flipside555 in forum SubtitleReplies: 2Last Post: 5th Jun 2019, 12:08 -
Strange issue with subtitle edit 3.5.7
By batemanj in forum SubtitleReplies: 1Last Post: 17th Feb 2019, 02:18 -
Why does this happen after OCR with Subtitle Edit?
By rocknrolla115 in forum SubtitleReplies: 5Last Post: 24th May 2018, 08:30 -
subtitle edit ocr for French
By howtwosavealif3 in forum SubtitleReplies: 2Last Post: 29th Mar 2017, 13:03 -
OCR a video with Google's OCR
By ThePi7on in forum SubtitleReplies: 0Last Post: 6th Mar 2017, 11:38