General question about subtitle file encoding for foreign characters

2nd Jan 2015 20:47 #1
Holofernes

View Profile

View Forum Posts

Private Message
Member

Join Date
Jan 2015
Hello guys, I'm new here!

I've got a general question about the encoding of subtitle files. I often use Japanese subtitles and had to adapt my VLC player to be able to display them in the first place. But one general problem remains that many subtitle files I find online, the text is not displayed correctly even if I open the file in Wordpad, and subsequently in VLC player.

Case in point: these subtitles for X-Men: First Class

The sample page indicates that the subtitles are indeed in Japanese, but once I download the file and open it, I get what you see in the attached image. I noticed that the file seems to be encoded as ANSI, so I thought saving it as UTF-8 might restore the Japanese characters, but without success.

This is a problem I've had lots of times with subtitle files found on the web, so I wanted to ask you guys with more experience what's wrong here.

Attached Thumbnails

Quote
2nd Jan 2015 21:31 #2
El Heggunte

View Profile

View Forum Posts
DECEASED

Join Date
Jun 2009

Location
Heaven
That's not "ANSI", that's Shift_JIS. In order to (correctly) convert and save the file to UTF-8 or UTF-16, you must use a text editor that translates the Shift_JIS double-bytes to their respective Unicode codepoints EditPlus and EmEditor are two well-known decent text editors. You can give a try to JWPce as well

Attached Thumbnails

Attached Files

X-Men.First.Class.2011-Unicode.srt (140.0 KB, 206 views)
Last edited by El Heggunte; 2nd Jan 2015 at 21:41. Reason: clarity
Quote
2nd Jan 2015 21:49 #3
Holofernes

View Profile

View Forum Posts

Private Message
Member

Join Date
Jan 2015
Interesting! Thanks a bunch for the explanation & already attaching the converted file!

I tried to reproduce the process with EditPlus, but failed. Which software did you use? Was it JWPce as in the screenshot?

Also, as none of the players I used was able to read the subtitle files in Shift_JIS encoding, do you happen to know why some files are being uploaded in that format?

Last edited by Holofernes; 2nd Jan 2015 at 21:50. Reason: One sentence missing.

Quote
2nd Jan 2015 22:30 #4
El Heggunte

View Profile

View Forum Posts
DECEASED

Join Date
Jun 2009

Location
Heaven
JWPce was used just for generating the screenshot.
The conversion was done with EditPlus:

1) open the original file
2) change the screen font to a Japanese monospaced font (e.g., MS Gothic)
3) reload the document as..., then select the appropriate encoding
4) save as a Unicode file.

do you happen to know why some files are being uploaded in that format?

Because most people, regardless of their respective native tongues, solemnly ignore the purpose and usefulness of Unicode
And the expression "most people" surely (and primarily) includes the designers of operating systems and of hardware, software developers, webmasters and webdesigners, the whole "IT folks", so to speak -.-

Below is the mess that we've got because of the backward-compatibility with the obsolete and narrow-minded thing named 'ASCII':

http://en.wikipedia.org/wiki/Binary-to-text_encoding#Encoding_standards

http://en.wikipedia.org/wiki/UTF-8

Code:

Putting it simply, computer systems available in 2013 are squarely based on the limitations of 1975 hardware using paradigms and heuristics developed in 1956.
Quote
3rd Jan 2015 03:14 #5
Nikse

View Profile

View Forum Posts

Private Message
Member

Join Date
Jul 2011

Location
Denmark
You can also use Subtitle Edit - "File" -> "Import subtitle with manual chosen encoding..." - try it
It will suggest an encoding + show a preview:

I fully agree with "El Heggunte" that Unicode should be used today - Unicode files normally have a BOM header which identifies them as e.g. UFT-8 so these files can opened correctly all over the world. Non Unicode files like ANSI relies on the current computers settings which is really bad.

I like this article about text encoding.

Quote
3rd Jan 2015 10:09 #6
El Heggunte

View Profile

View Forum Posts
DECEASED

Join Date
Jun 2009

Location
Heaven
Originally Posted by Nikse

http://www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/

Thanks for the URL.

I must say that I totally disagree with the "UTF-8 To The Rescue" point-of-view...

Originally Posted by all-about-unicode-utf8-character-sets

Best of all it is backward compatible with ASCII. Unlike some of the other proposed solutions, any document written only in ASCII, using only characters 0-127, is perfectly valid UTF-8 as well – which saves bandwidth and hassle.

ASCII must die, period.
Why "save bandwidth" and "storage space" in the age of on-the-fly compression, broadband connections and terabyte-sized HDDs
Oh, I see, easy money speaks louder.
And why I still haven't seen a single page of C or C++ source-code written as a UTF-16 file
Oh, I see, laziness and PSEUDO-"productivity" speak louder too -.-

Quote

General question about subtitle file encoding for foreign characters

Thread Tools

Search Thread