I've been using Subtitle Edit for years to import PGS subs using the OCR import but recently stumbled on a strange problem. In some rare but reproducible instances, SE "corrects" mostly good output from Tesseract with gibberish, as shown below:
[Attachment 48487 - Click to enlarge]
I've sent Nikolaj the subtitle file but it works fine for him. Can someone try to import the attached mks excerpt to see if you get the same behavior I'm seeing? Thanks in advance!
Interestingly, if I select "None" for the dictionary, then the line isn't munged. This isn't desirable for me though since I rely on the unknown words list to verify OCR accuracy against the images. I'm running 3.5.9 on Windows 10 and have tried clearing out all my settings to no avail.
		
			+ Reply to Thread
			
		
		
		
			
	
	
				Results 1 to 4 of 4
			
		- 
	
- 
	OCR with Tesseract v3.02 
 OCR with Tesseract v4.001
 -00:00:00,001 --> 00:00:01,208
 Or engaged.
 
 2
 00:00:03,085 --> 00:00:05,129
 Qflta stupid
 
 3
 00:00:05,212 --> 00:00:10,212
 My brother's marrying Leta,
 June the 6th.
 1
 -00:00:00,001 --> 00:00:01,208
 Or engaged.
 
 2
 00:00:03,085 --> 00:00:05,129
 - What?
 - It was a mistake in a stupid magazine.
 
 3
 00:00:05,212 --> 00:00:10,212
 My brother's marrying Leta,
 June the 6th.
- 
	Thanks for looking at this Metti--it's helpful to know I'm not the only one experiencing this weirdness. I haven't played with Tesseract 4 yet, but I just downloaded it and got the same results as you did. I did notice it seems to be much slower than Tesseract 3 though. I haven't been able to find much guidance on the differences between 3 and 4 as it pertains to Subtitle Edit. Can anyone recommend reasons for or against upgrading to 4? 
 
 Tesseract versions aside, it seems Subtitle Edit is doing something incorrect when it changes what Tesseract read correctly:
 
 Code:-What? -lt was a mistake in a stupid magazine. -> Qflta stupid 
- 
	I've been doing a lot of testing with Tesseract 4.0, and I've found that overall it's much more accurate than 3.02, although is a lot slower and more memory intensive. However, I've found a number of instances where it completely fails to see some words and phrases. See attached files for some examples. 
 
 In the case of "single letters.mkv", the first four lines are completely missed (they should spell "LATE").
 
 
 [Attachment 48682 - Click to enlarge]
 
 Second example "missing words.mkv"
 
 
 [Attachment 48683 - Click to enlarge]
 
 Subs 1 and 5 are missing a 2nd line that reads "-Hi.", and the others are missing the word "I" at various places. The "Original Tesseract only" engine mode sometimes will catch more of the missing I's, but all of the engine modes miss the Hi.Code:1 - Hey. 2 - Don't know! - Whatever else you do. 3 - Told you he wouldn't be long. - Hi! wanted to surprise you! 4 - Mean, tears are rolling down my cheeks... - Of course you start crying. 5 - Hello 
 
 Using 3.02, it's almost perfect except sub 5 is also missing the Hi:
 
 Can someone confirm these results? I've tried with 3.5.9 as well as the latest beta. Aside from these issues, I much prefer the accuracy of Tesseract 4.0 over 3.02. Are there other options I can tweak that may help? Thanks for any input!Code:1 - Hey. - Hi. 2 - I don't know! - Whatever else you do. 3 - Told you he wouldn't be long. - Hi! I wanted to surprise you! 4 - I mean, tears are rolling down my cheeks... - Of course you start crying. 5 - Hello. 
Similar Threads
- 
  Downloading new OCR dictionary to Subtitle Edit 3.5.4By flipside555 in forum SubtitleReplies: 2Last Post: 5th Jun 2019, 11:08
- 
  Strange issue with subtitle edit 3.5.7By batemanj in forum SubtitleReplies: 1Last Post: 17th Feb 2019, 01:18
- 
  Why does this happen after OCR with Subtitle Edit?By rocknrolla115 in forum SubtitleReplies: 5Last Post: 24th May 2018, 07:30
- 
  subtitle edit ocr for FrenchBy howtwosavealif3 in forum SubtitleReplies: 2Last Post: 29th Mar 2017, 12:03
- 
  OCR a video with Google's OCRBy ThePi7on in forum SubtitleReplies: 0Last Post: 6th Mar 2017, 10:38


 
		
		 View Profile
				View Profile
			 View Forum Posts
				View Forum Posts
			 Private Message
				Private Message
			 
 
			
			
 Quote
 Quote