Under review

Why does Textastic remove three bytes from equivalent MS Notepad built text file? (Don't remove byte order mark (BOM) on save if it exists)

Jeff MacKinnon 9 years ago in General updated by Bachsau 5 years ago 8
I have a large text file (consisting of HTML and JavaScript) which was created with Windows Notepad using Unicode UTF-8 encoding, and then uploaded to my Web server. Textastic downloads it (via FTP), resulting in a file on my iPad of the exact same size as the original. Textastic reports it correctly as having Unicode (UTF-8) Encoding, Windows (CRLF) Line Endings, and HTML Syntax Definition.

However, once I force Textastic to save the file (by modifying one character, for example), the resulting file is three bytes shorter than the original. Visual inspection of the befoure-and-after files (in Notepad) reveals no differences (same number of lines, with an apparent carriage return / line feed after the last line.

Why? What three bytes are being removed from the file?
it is probably textastic compressing the file... I am no professional on this topic, but this seems the most likely reason.
Under review
I can only guess, but maybe Unicode characters are encoded differently than in the original file when Textastic saves them?

Also, Textastic changes all line breaks to be the same when it saves a file. Since in your case the result is CRLF (2 bytes) instead of LF or CR (1 byte), this could only result in a bigger, not a smaller file though.

To be sure, you would have to run the file through a diff program.
Thanks, Adrian and Alexander for your fast replies.

Alexander, the file is about 45KB in size, with about 1,440 lines, so the loss of three bytes is a bit strange. However, I had already decided to run the before-and-after files through the Windows command prompt File Compare (FC) program - and will do as soon as I get a chance. I'll reply here with the results once I've done that. Thanks again
Alexander, I finally got around to analyzing this.

I ran Windows command prompt file compare (FC), followed by a file dump program. The "missing" three bytes turn out to be Unicode byte order mark (BOM) characters (0xEF, 0xBB, 0xBF) that Microsoft Notepad inserts at the very beginning of each Unicode text file. Apparently, the Textastic text editor strips them when it saves the file. Should it be doing this? I want to able to modify this (and other) files using both Notepad and Textastic interchangeably without problems.

UPDATE: I'm including an excerpt (below) from the Wikipedia article on "byte order mark." It states that, although the BOM is optional, the Unicode standard does NOT recommend removing it, if it is present.

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this.

The Unicode Standard permits the BOM in UTF-8,[2] but does not require or recommend its use.[3] Byte order has no meaning in UTF-8,[4] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[5] [6] The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[7]

Currently Textastic does never write a BOM when saving UTF-8 files, so if it exists it is removed when you change the file contents.

I'll consider changing this behavior.
Yes, please do. Thanks, Alexander!

I met the same issue. Well, 4 years pasted since this question was asked. Are there any chances UTF-8 with BOM will be supported?

Same here. It definitely should not remove any BOM without even asking the user.