I am trying to extract subtitles from two taiwanese series "我們與惡的距離: The world between us" and "想見你: Someday or one day" (official translation, not literal). As for most series in mandarin, the subs are hardcoded and I am looking for a way to extract an .srt files out of it, to produce a workable text. In general, it would be great to be able to retrieve all subs from series or movies en mandarin, which are almost always containing hardcoded subs.
The purpose is to do some vocabulary analysis using R once having extracted it, not translation. It actually seems easier to find the english subs in some case.
Someday or One day can be found here: http://www.mp4ba.cc/gangtaiju/598.html BT下載 for bit torrent download.
Thank you for your help!
ps: I don't know much about subtitles extraction but I've looked at video proc, my mp4box gui and media info.
+ Reply to Thread
Results 1 to 3 of 3
There's nothing to extract. Hardcoded subs are images which are now part of the video.
You have to use an OCR (Optical Character Recognition) program to convert the subs to text/characters. Here's some threads that discuss the process: https://www.videohelp.com/search?q=ocr+chinse+subtitles&siteurl=&Search=Search
Note that even the best software requires a lot of tweaking during and after the process. This is especially critical for Hanzi where a single stroke or two can completely change the meaning of the word/character.
So basically there is not change since 2014 more or less, is there?
How about OCR with subtitle edit? or even SPlayer? Did someone try the subtitle generator from SPlayer?
It's very frustrating because the subs are in the movie, very clean and proper but there is no way to find them in a separate file. As well it is super easy to find those series for free streaming or download but it doesn't seem possible to buy it, and eventually get the subs from there.
In general google OCR is much better than before with chinese, even better than pleco in my opinion, but if there is no way to automate the process of getting the subs, it's not worth it.
Anyway, if anybody has other suggestion, please let me know!