EDIT: Solved thanks to Nikse's great program, SubtitleEdit.
Hi, I am new here (I am noob, but I've been trying options and reading forums for days, and I can't find a solution).
I am trying to learn how to convert the graphic subtitles of video recorded from DVB to a text based subtitles.
The subtitles look like this on the video (much better actually, not blurry at all, VLC makes weird captures), with different colors to differentiate the speaker:
I'd like to convert these subtitles into SRT in order to view the recorded program on other devices rather than the computer. A quick process would be perfect, since I will only watch the program and then erase it.
Before I start explaining what I've tried so far, here is a sample (just 27mb) in case anyone want to try it: http://www.megaupload.com/?d=W82MAKEX (original recorded file is called "(grabación original) subs digitales.mpeg")
What I've tried so far:
-The recorded video is a MPEG TS file that includes all streams (video, two different audios -Spanish and English- and graphic Spanish subtitles)
-I've demuxed all streams with ProjectX, trying several options for subtitles: SON+BMP, SUP and I also selected IDX+SUB.
-I've opened the IDX+SUB with Subrip and Subresync, impossible to make a proper OCR:
-Same for SUP file and DVDSubEdit 1.52, very poor quality for an OCR:
-And about the SON+BMP, these are the best ones, BMP images are just perfect for OCR, and SON file has the time for each subtitle, BUT I haven't found any "SON+BMP to SRT" program...
Here is a extract of the SON file:
Here, a couple of BMP examples, crystal clear:Code:SP_NUMBER START END FILE_NAME Color (0 1 2 3) Contrast (0 2 7 11) Display_Area (000 474 720 562) 0000 00:00:02:16 00:00:05:07 subs digitales_st00000p4.bmp Color (0 8 2 1) Contrast (0 4 7 2) Display_Area (000 426 720 558) 0001 00:00:07:01 00:00:09:21 subs digitales_st00001p4.bmp Color (0 0 1 2) Contrast (0 0 2 7) Display_Area (000 426 720 514)
I've tried an online free OCR on that image, and got this:
Perfect recognition!
So, I am looking either for a "SON+BMP to SRT" program or a way to extract proper and decent IDX+SUB or SUP files from the MPEG TS (My guess is that the colors of the subtitles are the problem).
Any help is appreciated.
Thanks in advance.
+ Reply to Thread
Results 1 to 26 of 26
-
Last edited by edea; 24th Oct 2011 at 16:18.
-
Hi edia!
You could try subtitle edit: http://subtitleedit.googlecode.com/files/SubtitleEdit32Setup.zip
SE should be able to import+ocr both sub/idx and son/bmp... I would like to add support for importing subtitles directly from ts files, but that will be a later version. -
Another avenue you can try is open the idx/sub in BDSup2Sub and export as ifo/sup. Load the ifo/sup in DVDSubEdit.
Do automatic OCR and export as .srt. I've had good luck with English subs. I'm not sure if the upside down question marks will throw it off.
Usually lines that don't auto OCR well with this method show up with one or more underscores. Sort of like this:
ap__%%*&__7?_x
If you search for underscore in the output and don't find any chances are it came out clean.
As I say though, my only experience with this technique is using English Subs.http://milesaheadsoftware.org/
Fully enabled freeware for Windows PCs. -
Thanks! Your program looks great, but I can't open the SON subtitle, I get this error:
Code:Consulte el final de este mensaje para obtener más detalles sobre cómo invocar a la depuración Just-In-Time (JIT) en lugar de a este cuadro de diálogo. ************** Texto de la excepción ************** System.ArgumentException: El parámetro no es válido. en System.Drawing.Bitmap.LockBits(Rectangle rect, ImageLockMode flags, PixelFormat format, BitmapData bitmapData) en System.Drawing.Bitmap.LockBits(Rectangle rect, ImageLockMode flags, PixelFormat format) en Nikse.SubtitleEdit.Logic.FastBitmap.LockImage() en Nikse.SubtitleEdit.Forms.VobSubOcr.GetSubtitleBitmap(Int32 index) en Nikse.SubtitleEdit.Forms.VobSubOcr.ShowSubtitleImage(Int32 index) en Nikse.SubtitleEdit.Forms.VobSubOcr.SubtitleListView1SelectedIndexChanged(Object sender, EventArgs e) en System.Windows.Forms.ListView.OnSelectedIndexChanged(EventArgs e) en System.Windows.Forms.ListView.WmReflectNotify(Message& m) en System.Windows.Forms.ListView.WndProc(Message& m) en System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m) en System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m) en System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam) ************** Ensamblados cargados ************** mscorlib Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll ---------------------------------------- SubtitleEdit Versión del ensamblado: 3.2.0.33640 Versión Win32: 3.2.0.33640 Código base: file:///C:/Archivos%20de%20programa/Subtitle%20Edit/SubtitleEdit.exe ---------------------------------------- System Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll ---------------------------------------- System.Windows.Forms Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll ---------------------------------------- System.Drawing Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll ---------------------------------------- System.Xml Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Xml/2.0.0.0__b77a5c561934e089/System.Xml.dll ---------------------------------------- System.Windows.Forms.resources Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Windows.Forms.resources/2.0.0.0_es_b77a5c561934e089/System.Windows.Forms.resources.dll ---------------------------------------- System.XML.resources Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Xml.resources/2.0.0.0_es_b77a5c561934e089/System.Xml.resources.dll ---------------------------------------- mscorlib.resources Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll ---------------------------------------- NHunspell Versión del ensamblado: 0.9.6.0 Versión Win32: 0.9.6.0 Código base: file:///C:/Archivos%20de%20programa/Subtitle%20Edit/NHunspell.DLL ---------------------------------------- System.Drawing.resources Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Drawing.resources/2.0.0.0_es_b03f5f7f11d50a3a/System.Drawing.resources.dll ---------------------------------------- ************** Depuración JIT ************** Para habilitar la depuración Just In Time (JIT), el archivo de configuración de esta aplicación o equipo (machine.config) debe tener el valor jitDebugging establecido en la sección system.windows.forms. La aplicación también se debe compilar con la depuración habilitada Por ejemplo: <configuration> <system.windows.forms jitDebugging="true" /> </configuration> Cuando esté habilitada la depuración JIT, cualquier excepción no controlada se enviará al depurador JIT registrado en el equipo en lugar de controlarlo mediante el cuadro de diálogo.
And if I open the SUB/IDX (the clearest that I've been able to extract, with ProjectX, selecting "UKFreeView", but not as clear as the BMP files), I get this after executing the OCR:
Nothing is recognized! I must say I've never had good results with Tesseract. In Ubuntu, I only get almost perfect results with GOCR:
yo@desktop:~$ gocr subsdigitales_st00001p4.bmp
bmptoppm: Windows BMP, 720x132x8
bmptoppm: WRITING PPM IMAGE
Bueno, bueno, deja que te mire.
Es que has...
yo@desktop:~$ gocr subsdigitales_st00002p4.bmp
bmptoppm: Windows BMP, 720x88x8
bmptoppm: WRITING PPM IMAGE
Pero _quė te ocurre?
-Que me siento feliz.
yo@desktop:~$ gocr subsdigitales_st00013p4.bmp
bmptoppm: Windows BMP, 720x132x8
bmptoppm: WRITING PPM IMAGE
...para molestar a tu abuelita.
- _Sharon !
yo@desktop:~$
It only fails with tildes and inverted question and exclamation marks (á é í ó ú, Á É Í Ó Ú, ¿ ¡). It even recognizes "ñ".
Thanks, I'll try it. -
Hi edea!
Thx for testing
I could not open the SON file with SE 3.2 at all...
In order to use Tesseract you need to use "Spanish" tesseract dictionary for Spanish subtitles (and English for English subs).
I've fixed the SON file reading + included Spanish dictionaries in this version: http://www.nikse.dk/SubtitleEdit.zip
You could also use "Image compare" as ocr method (a bit like subrip) -
Thanks indeed for trying to fix it and adding the Spanish dictionaries. In this new version, I get the same error when I open the SON file:
Code:Consulte el final de este mensaje para obtener más detalles sobre cómo invocar a la depuración Just-In-Time (JIT) en lugar de a este cuadro de diálogo. ************** Texto de la excepción ************** System.ArgumentException: El parámetro no es válido. en System.Drawing.Bitmap.LockBits(Rectangle rect, ImageLockMode flags, PixelFormat format, BitmapData bitmapData) en System.Drawing.Bitmap.LockBits(Rectangle rect, ImageLockMode flags, PixelFormat format) en Nikse.SubtitleEdit.Logic.FastBitmap.LockImage() en Nikse.SubtitleEdit.Forms.VobSubOcr.GetSubtitleBitmap(Int32 index) en Nikse.SubtitleEdit.Forms.VobSubOcr.ShowSubtitleImage(Int32 index) en Nikse.SubtitleEdit.Forms.VobSubOcr.SubtitleListView1SelectedIndexChanged(Object sender, EventArgs e) en System.Windows.Forms.ListView.OnSelectedIndexChanged(EventArgs e) en System.Windows.Forms.ListView.WmReflectNotify(Message& m) en System.Windows.Forms.ListView.WndProc(Message& m) en System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m) en System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m) en System.Windows.Forms.NativeWindow.Callback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam) ************** Ensamblados cargados ************** mscorlib Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll ---------------------------------------- SubtitleEdit Versión del ensamblado: 3.2.0.24454 Versión Win32: 3.2.0.24454 Código base: file:///D:/Documentos%20de%20Inma/Downloads/SubtitleEdit/SubtitleEdit.exe ---------------------------------------- System.Windows.Forms Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Windows.Forms/2.0.0.0__b77a5c561934e089/System.Windows.Forms.dll ---------------------------------------- System Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System/2.0.0.0__b77a5c561934e089/System.dll ---------------------------------------- System.Drawing Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Drawing/2.0.0.0__b03f5f7f11d50a3a/System.Drawing.dll ---------------------------------------- System.Xml Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Xml/2.0.0.0__b77a5c561934e089/System.Xml.dll ---------------------------------------- System.Windows.Forms.resources Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Windows.Forms.resources/2.0.0.0_es_b77a5c561934e089/System.Windows.Forms.resources.dll ---------------------------------------- System.XML.resources Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Xml.resources/2.0.0.0_es_b77a5c561934e089/System.Xml.resources.dll ---------------------------------------- mscorlib.resources Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/mscorlib.dll ---------------------------------------- System.Drawing.resources Versión del ensamblado: 2.0.0.0 Versión Win32: 2.0.50727.3053 (netfxsp.050727-3000) Código base: file:///C:/WINDOWS/assembly/GAC_MSIL/System.Drawing.resources/2.0.0.0_es_b03f5f7f11d50a3a/System.Drawing.resources.dll ---------------------------------------- ************** Depuración JIT ************** Para habilitar la depuración Just In Time (JIT), el archivo de configuración de esta aplicación o equipo (machine.config) debe tener el valor jitDebugging establecido en la sección system.windows.forms. La aplicación también se debe compilar con la depuración habilitada Por ejemplo: <configuration> <system.windows.forms jitDebugging="true" /> </configuration> Cuando esté habilitada la depuración JIT, cualquier excepción no controlada se enviará al depurador JIT registrado en el equipo en lugar de controlarlo mediante el cuadro de diálogo.
If I select "Spanish" and then start the OCR, I get another error: MSVCR.dll is not found, and can't continue. -
I cannot recreate this error with the uploaded son/bmp... any change you could upload the full son/bmp set?
The Tesseract version included was not correct, this should be fixed: http://www.nikse.dk/SubtitleEdit.zip -
Hi edea!
The uploaded file is the same as the first sample file (with spaces in file name)!
The one SE crashes on must be another without spaces in file name, right (or wrong)?
The uploaded SON works very well for me:
>Any chance you could add GORC as another OCR method?
Should be possible as it has a command line interface. Hm, I could not make GORC open any files - tried with png, bmp, and tif... -
Hi Nikse!
I sent you an email with a video showing the error.
I've just tried Subtitle Edit on a different computer, and it works! No error, and it works like a charm! (but I can't use my TV card here, since there is no antenna connection here)
What could be the problem with the other computer? Could it be the .Net Framework version? I had to install .Net Framework 4 in the other computer because another subtitle program asked me to do it in order to work. -
Hi edea!
Thx for the video/info!
It looks like it's due to some limitation on WinXP and bitmaps, which should be fixed here I hope: http://www.nikse.dk/SubtitleEdit.zip
(it will be a bit slower on WinXP - but should not crash) -
Thanks! I am going to test it now.
But how did the other version work in the other computer? Both have Windows XP SP3 -
-
Could be...
The only version that I tried on the laptop (the "other" computer) was build 19559, was the XP issue fixed in that version?
Now I am trying to use SE in Ubuntu, but I still don't know how to open it (mono is installed, I think)
Anyway, thank you very much for everything! -
Could be... just go with latest version - 3.2.2
Have you tried with "mono SubtitleEdit.exe" ?
(check the readme file) -
Yes, that's is what I tried (I read the readme) but gave an error. I'll copy it later (now I am using Windows, I trying to cut the MPEG TS file)
Edit: this is what I get:
Code:yo@desktop:~/Descargas/SE32Linux$ mono SubtitleEdit.exe ** (SubtitleEdit.exe:14296): WARNING **: The following assembly referenced from /home/yo/Descargas/SE32Linux/SubtitleEdit.exe could not be loaded: Assembly: System.Windows.Forms (assemblyref_index=1) Version: 2.0.0.0 Public Key: b77a5c561934e089 The assembly was not found in the Global Assembly Cache, a path listed in the MONO_PATH environment variable, or in the location of the executing assembly (/home/yo/Descargas/SE32Linux/). ** (SubtitleEdit.exe:14296): WARNING **: Could not load file or assembly 'System.Windows.Forms, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' or one of its dependencies. ** (SubtitleEdit.exe:14296): WARNING **: Missing method EnableVisualStyles in assembly /home/yo/Descargas/SE32Linux/SubtitleEdit.exe, type System.Windows.Forms.Application Unhandled Exception: System.IO.FileNotFoundException: Could not load file or assembly 'System.Windows.Forms, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' or one of its dependencies. File name: 'System.Windows.Forms, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' yo@desktop:~/Descargas/SE32Linux$
Last edited by edea; 20th Oct 2011 at 15:23.
-
Hi Nikse!
I'm testing you great SE program thanks to edea suggestion.
I'm running SE v3.2.2, build 25663 in XP SP3, .NET 2.0 and all work fine for me.
My old workflow was use suprip/subrip and after OpenOffice to spell check, now I can make the two task togheter thanks to SE.
And the job is more easy thanks to new word/names stored in static files.
But the best improve is the rules in eng_OCRFixReplaceList.xml, I'm making mi own spa_OCRFixReplaceList.xml.
I have a question.
Sometimes I get srt files, form others users, than I need spell check. Loading the file and clicking in 'Spell Check' the rules in xxx_OCRFixReplaceList.xml don't work.
Is possible to add this option with something similar to xxx_OCRFixReplaceList.xml?
Thanks. -
-
Thanks.
Seems the language is recognized and with spanish subs the spa_OCRFixReplaceList.xml is used. OK.
BTW, most of my problems with 'l' -> 'I' was solved changing the
<WordPart from="l" to="i" />
with
<WordPart from="l" to="I" />
if the change is inside (not at begining) a lowcase word a second pass solve the problem.
What is the difference between <PartialWordsAlways> and <PartialWords>?
The description is the same for both:
<!-- Will be used to check words not in dictionary -->
<!-- If new word(s) exists in spelling dictionary, it(they) is accepted -->
Now I'm working with other tipical spanish problem: 'i' -> "¡" (begin of exclamation char)
with other text editor I'm using Regular Expressions (lowercase i followed by a capital letter must be changed to '¡' followed by the same capital letter)
but I don't know the Regular Expressions sintax (there are many) used by SE.
Thanks for your help. -
It looks like you're missing "System.Windows.Forms"
You might need a newer version of Mono - or perhaps this can help: http://ubuntuforums.org/showthread.php?t=851578 -
"PartialWordsAlways" is always replaced
"PartialWords" is only replace if new word is correct spelled + longer than five characters.
(I'll update the comments - thx)
Hm, in Edit -> Multi replace you can try this reg ex: \b(?<test>i)[A-Z]
Also, try to "Tools -> Fix common errors - Fix Spanish question and exclamation marks". -
-
-
-
Similar Threads
-
Coverting recorded TS or MPEG recorded from DVB card to XVid Avi...
By MohamedYousri in forum Newbie / General discussionsReplies: 4Last Post: 7th Nov 2010, 15:38 -
Demuxing HDTV .ts recorded in DVB-TH tuner - help
By newstrength in forum DVB / IPTVReplies: 7Last Post: 2nd Jun 2010, 13:34 -
looking for good OCR software that will convert text in jpg to regular text
By jimdagys in forum ComputerReplies: 6Last Post: 27th Jun 2008, 10:38 -
No sound or DVB subs in Zoomplayer.
By bjur in forum Software PlayingReplies: 3Last Post: 20th Feb 2008, 10:43 -
Captured dvb-s dvb-t mpeg video to DVD
By tonut in forum Authoring (DVD)Replies: 6Last Post: 7th Sep 2007, 07:02