(sorry for the cross-post in subtitles, just noticed that this is the forum for new tools)
Hi,
I'm developing a tool to extract closed caption data from MPEG files. At this point it's very usable (yet I still have a long to-do list), so I'd like to share it with everyone who needs subtitles for any reason. A summary of the current status:
- The .zip file includes all source code, plus linux and windows binaries.
- It correctly extracts closed captions from DVD and most transport streams (.ts). So you can record a TV show in your digital (ATSC) capture card and expect to get a correct (both in text and timing) transcript.
- It "supports" other input streams such as ReplayTV, but this is theorical, meaning that I've followed specs but I don't have any sample to actually test.
- Current version generates a .bin file you need to process with McPoodle's tools. They are free so this shouldn't be a problem. I'm now working on direct .srt generation but I'm having a few issues with XDS data getting in the way and I want to fix this before making this new version public.
- It's more or less fast, or at least faster than the alternatives.
You can get it here:
http://sourceforge.net/projects/ccextractor/
If you check the stats you'll see that this project is becoming (slowly) popular.
I'm actively developing it, so I need people willing to try it, and help me fix the bugs (sending samples that it can't process correctly is the best thing you can do to help except fixing it yourself).
Most important things in the to do list:
- Complete .srt generation. Current status: Produces basic .srt files for CC1 data, still some junk in CC2 data caused by XDS bytes. "basic .srt" means the text is there but font type (italics and underlined) is missing.
- EIA-708 extraction. Still a long time before the 608 shutdown deadline, though.
- Fix bugs.
- Possibly extract to better formats than .srt.
+ Reply to Thread
Results 1 to 7 of 7
-
-
I am interesting in testing this out. My situation is that I am using an OnAir GT USB Tuner device as my source for an SageTV HTPC implementation. The Mpg2 files that I am getting when recording HD DO contain the CC information, however I can only render them using the OnAir in box program, and not the SageTV app. Sage supports rendering .smi files so if I can get the data out of the file (via a batch job, etc.) and can get SageTV to render it. I think your application might help me out with this, but at first try I got no results.
If you think we can work together and I can help youwith testing let me know.
Thanks.
Eckwell -
I tested the previous version. It functioned okay, but it missed quite a few captions. I wanted to give you the data to help resolve this, but I didn't have the time.
I'll give this version a try.ICBM target coordinates:
26° 14' 10.16"N -- 80° 16' 0.91"W -
I just uploaded 0.20 to sourceforge.
It can now produce .smi files directly from the captures.
Also, it supports Hauppauge PVR-250 captures - and most likely a bunch of other cards as well since many store closed captions in the same format.
BTW if you want to contact me for help or to work with me on fixing bugs, etc, please email me - I don't check the forums often. -
I tested v.20 and noticed issues.
First, it is still missing quite a few captions. I'll try to gather the specific GOP header packs with the user information where the typical error occurs.
Second, I noticed that the program reports "Suspected False Picture Headers". The quantity that it reported on my test was over 1500. Let me tell you that there is no such thing as a "false header" in the spec. When you parse the byte sequence "00 00 01 00", you know that you are at the start of a new picture.
I also notice that you are timing by using your aquired "user data". This is definitely NOT the way to time, since user data may NOT be continuous. Using my test sample, the program reports a total length of 38 minutes. The actual length is 41 minutes. A better way would be to count the PICTURE HEADERS plus the number of REPEAT FIRST FIELD flags in the PICTURE CODING EXTENSION HEADERS then divide by 29.97. Since the user data may not be continuous, you need to pad the recovered binary file with "80 80" for every picture and every RFF flag that does NOT have any information in order to provide the proper time.
Another way to obtain the timing is from the time stamp for the first frame included with the GOP HEADER information (bytes 4, 5, 6 and 7 of the header). When the user data is not contiuous, this may be the best way to re-establish sync once user data resumes.
I'll try to gather more debug information for you.ICBM target coordinates:
26° 14' 10.16"N -- 80° 16' 0.91"W -
Originally Posted by SLK001
Originally Posted by SLK001
Originally Posted by SLK001
THe padding itself is based on the PTS (Presentation Time Stamp) field.
Originally Posted by SLK001
Originally Posted by SLK001 -
sweet looks good may test it later also should this tool be added to the tool list
Similar Threads
-
Ripped DVD - Search Through Subtitles / Closed Captions?
By foochuck in forum DVD RippingReplies: 2Last Post: 19th Jun 2010, 10:59 -
Question how to extract Closed Captions (cc) from VCR tape
By jimdagys in forum SubtitleReplies: 8Last Post: 24th Jul 2009, 15:48 -
No Closed Captions on Computer, Only TV
By JimBarbasol in forum Authoring (DVD)Replies: 7Last Post: 4th Jun 2009, 17:17 -
Closed Captions and ProjectX
By sambat in forum DVB / IPTVReplies: 12Last Post: 29th Mar 2009, 03:41 -
DVD's with closed captions to MP4's with subtitles
By sebrame in forum SubtitleReplies: 5Last Post: 22nd Mar 2009, 14:09