How to config visual studio to use UTF-8 as the default encoding for all projects?

Visual Studio supports EditorConfig files (https://editorconfig.org/) Visual Studio (VS2017 and later) searches for a file named ‘.editorconfig’ in the directory containing your source files, or anywhere above this directory in the hierarchy. This file can be used to direct the editor to use utf-8. I use the following: [*] end_of_line = lf charset = utf-8 …

Read more

Isn’t on big endian machines UTF-8’s byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?

The byte order is different on big endian vs little endian machines for words/integers larger than a byte. e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most …

Read more

What is the correct method for calculating the Content-length header in node.js

Content-Length header must be the count of octets in the response body. payload.length refers to string length in characters. Some UTF-8 characters contain multiple bytes. The better way would be to use Buffer.byteLength(string, [encoding]) from http://nodejs.org/api/buffer.html. It’s a class method so you do not have to create any Buffer instance.

Is the u8 string literal necessary in C++11

The encoding of “Test String” is the implementation-defined system encoding (the narrow, possibly multibyte one). The encoding of u8″Test String” is always UTF-8. The examples aren’t terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed …

Read more

Python 3 CSV file giving UnicodeDecodeError: ‘utf-8’ codec can’t decode byte error when I print

We know the file contains the byte b’\x96′ since it is mentioned in the error message: UnicodeDecodeError: ‘utf-8′ codec can’t decode byte 0x96 in position 7386: invalid start byte Now we can write a little script to find out if there are any encodings where b’\x96’ decodes to ñ: import pkgutil import encodings import os …

Read more

How to iterate over Unicode grapheme clusters in Rust?

You want to use the unicode-segmentation crate: use unicode_segmentation::UnicodeSegmentation; // 1.5.0 fn main() { for g in “नमस्ते्”.graphemes(true) { println!(“- {}”, g); } } (Playground, note: the playground editor can’t properly handle the string, so the cursor position is wrong in this one line) This prints: – न – म – स् – ते् The …

Read more