Unicode displaying incorrectly
Software Engineering Team Lead and Director of Cloudsure
Problem
We made content changes on an MVC view in a .NET application. The Unicode characters were displaying fine until the change was made which resulted in gibberish on the screen where the Unicode characters used to be.
Investigation
-
Using Git we looked at the
diff
to see if we could find anything strange that was introduced to break the display. We didn't see anything out of the ordinary. -
We confirm the
<meta charset="utf-8" />
tag was present in the shared layout and that thecontent-type
of the response wastext/html; charset=utf-8
. -
We replaced the Unicode character with its HTML entity and it worked. This is more explicit and doesn't rely on any behind-the-scenes black magic but unsustainable with the way we receive the content unless it was automated.
-
We took to Google and started searching. We came across a StackOverflow question where there was mention of the BOM and something struck me.
Cause of failure
Hunch
We experimented with a Visual Studio add-on called Strip'em to automatically save line-endings in LF as Visual Studio doesn't have a global setting for this.
In retrospect this wasn't a great idea but we were hoping to avoid seeing:
-
the wall of pink in our Git diffs and be able to diff without having to ignore white spaces.
-
nasty dialogs every time we open a file that doesn't have consistent line-endings especially when working with files that were created on other operating systems.
Tip: In Visual Studio 2015 you can change the encoding and line ending for
an individual file at File > Advanced Save Options
.
This add-on had an unexpected "feature" that was not advertised in the dialog. When it removed the line-endings, the BOM at the start of the file was also removed.
There's two problems with it [Strip-em], firstly it kills the utf-8 magic bytes which windows likes, and also causes a change after file save so VS asks to reload changes, I know that the latter can be avoided but sometimes you don't want to reload changes automatically. ~ Brett Ryan commented on StackOverflow
Experiment
As we couldn't find any reason for the Unicode characters to be displaying differently now, we believed that it must be the add-on. We needed to test it.
-
We disabled the line-ending conversion in the add-on but had to add the BOM back.
-
We opened the file in Notepad++ (before we knew we could save it in Visual Studio)
Encoding > Encode in UTF-8-BOM
, saved the file, viewed it and it displayed correctly. -
We changed the encoding back to
UTF-8
without the BOM and saw that it broke. -
This confirmed that the BOM character allowed the hardcoded Unicode characters to display correctly on Windows but we had to prove that the add-on was the culprit. (We hadn't seen the StackOverflow comment at that time)
-
We opened the BOM file in Visual Studio and saved it. It worked.
You can also see the BOM in a Hex editor.
Solution
Saving the UTF-8 file with the BOM signature on Windows solves the problem.
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present, or the file contains only ASCII bytes. ~ Wikipedia
Note: As I was so curious about the byte order mark, I investigated and documented my findings.
References
- Stop Visual Studio from mixing line endings in files
- Notepad++ - Free source code editor which supports several programming languages running under the Microsoft Windows environment.
- HxD - Freeware Hex Editor and Disk Editor.