Hello guys, I'm new here!
I've got a general question about the encoding of subtitle files. I often use Japanese subtitles and had to adapt my VLC player to be able to display them in the first place. But one general problem remains that many subtitle files I find online, the text is not displayed correctly even if I open the file in Wordpad, and subsequently in VLC player.
Case in point: these subtitles for X-Men: First Class
The sample page indicates that the subtitles are indeed in Japanese, but once I download the file and open it, I get what you see in the attached image. I noticed that the file seems to be encoded as ANSI, so I thought saving it as UTF-8 might restore the Japanese characters, but without success.
This is a problem I've had lots of times with subtitle files found on the web, so I wanted to ask you guys with more experience what's wrong here.
+ Reply to Thread
Results 1 to 6 of 6
-
-
That's not "ANSI", that's Shift_JIS. In order to (correctly) convert and save the file to UTF-8 or UTF-16, you must use a text editor that translates the Shift_JIS double-bytes to their respective Unicode codepoints
EditPlus and EmEditor are two well-known decent text editors. You can give a try to JWPce as well
Last edited by El Heggunte; 2nd Jan 2015 at 21:41. Reason: clarity
-
Interesting! Thanks a bunch for the explanation & already attaching the converted file!
I tried to reproduce the process with EditPlus, but failed. Which software did you use? Was it JWPce as in the screenshot?
Also, as none of the players I used was able to read the subtitle files in Shift_JIS encoding, do you happen to know why some files are being uploaded in that format?Last edited by Holofernes; 2nd Jan 2015 at 21:50. Reason: One sentence missing.
-
JWPce was used just for generating the screenshot.
The conversion was done with EditPlus:
1) open the original file
2) change the screen font to a Japanese monospaced font (e.g., MS Gothic)
3) reload the document as..., then select the appropriate encoding
4) save as a Unicode file.
do you happen to know why some files are being uploaded in that format?
And the expression "most people" surely (and primarily) includes the designers of operating systems and of hardware, software developers, webmasters and webdesigners, the whole "IT folks", so to speak -.-
Below is the mess that we've got because of the backward-compatibility with the obsolete and narrow-minded thing named 'ASCII':
http://en.wikipedia.org/wiki/Binary-to-text_encoding#Encoding_standards
http://en.wikipedia.org/wiki/UTF-8
Code:Putting it simply, computer systems available in 2013 are squarely based on the limitations of 1975 hardware using paradigms and heuristics developed in 1956.
-
You can also use Subtitle Edit - "File" -> "Import subtitle with manual chosen encoding..." - try it
It will suggest an encoding + show a preview:
I fully agree with "El Heggunte" that Unicode should be used today - Unicode files normally have a BOM header which identifies them as e.g. UFT-8 so these files can opened correctly all over the world. Non Unicode files like ANSI relies on the current computers settings which is really bad.
I like this article about text encoding. -
Thanks for the URL.
I must say that I totally disagree with the "UTF-8 To The Rescue" point-of-view...
Originally Posted by all-about-unicode-utf8-character-sets
Why "save bandwidth" and "storage space" in the age of on-the-fly compression, broadband connections and terabyte-sized HDDs
Oh, I see, easy money speaks louder.
And why I still haven't seen a single page of C or C++ source-code written as a UTF-16 file
Oh, I see, laziness and PSEUDO-"productivity" speak louder too -.-