utf-8 – Row Coding

UTF-8 in Windows 7 CMD [duplicate]

November 28, 2023 by Tarik

This question has been already answered in Unicode characters in Windows command line – how? You missed one step -> you need to use Lucida console fonts in addition to executing chcp 65001 from cmd console.

How to config visual studio to use UTF-8 as the default encoding for all projects?

November 27, 2023 by Tarik

Visual Studio supports EditorConfig files (https://editorconfig.org/) Visual Studio (VS2017 and later) searches for a file named ‘.editorconfig’ in the directory containing your source files, or anywhere above this directory in the hierarchy. This file can be used to direct the editor to use utf-8. I use the following: [*] end_of_line = lf charset = utf-8 … Read more

Isn’t on big endian machines UTF-8’s byte order different than on little endian machines? So why then doesn’t UTF-8 require a BOM?

November 27, 2023 by Tarik

The byte order is different on big endian vs little endian machines for words/integers larger than a byte. e.g. on a big-endian machine a short integer of 2 bytes stores the 8 most significant bits in the first byte, the 8 least significant bits in the second byte. On a little-endian machine the 8 most … Read more

Eclipse wrong Java properties UTF-8 encoding

November 26, 2023 by Tarik

Root cause: By default ISO 8859-1 character encoding is used for Eclipse properties file (read here), so if the file contains any character beyond ISO 8859-1 then it will not be processed as expected. Solution 1 If you use Eclipse then you will notice that it implicitly converts the special character into \uXXXX equivalent. Try … Read more

What is the correct method for calculating the Content-length header in node.js

November 26, 2023 by Tarik

Content-Length header must be the count of octets in the response body. payload.length refers to string length in characters. Some UTF-8 characters contain multiple bytes. The better way would be to use Buffer.byteLength(string, [encoding]) from http://nodejs.org/api/buffer.html. It’s a class method so you do not have to create any Buffer instance.

Is the u8 string literal necessary in C++11

November 26, 2023 by Tarik

The encoding of “Test String” is the implementation-defined system encoding (the narrow, possibly multibyte one). The encoding of u8″Test String” is always UTF-8. The examples aren’t terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed … Read more

Dump in PyYaml as utf-8

November 25, 2023 by Tarik

Found the answer myself. I just had to dump it with the argument allow_unicode=True Source: http://dpinte.wordpress.com/2008/10/31/pyaml-dump-option/

UTF-8 problems while reading CSV file with fgetcsv

November 24, 2023 by Tarik

Python 3 CSV file giving UnicodeDecodeError: ‘utf-8’ codec can’t decode byte error when I print

November 21, 2023 by Tarik

We know the file contains the byte b’\x96′ since it is mentioned in the error message: UnicodeDecodeError: ‘utf-8′ codec can’t decode byte 0x96 in position 7386: invalid start byte Now we can write a little script to find out if there are any encodings where b’\x96’ decodes to ñ: import pkgutil import encodings import os … Read more

How to iterate over Unicode grapheme clusters in Rust?

November 15, 2023 by Tarik

You want to use the unicode-segmentation crate: use unicode_segmentation::UnicodeSegmentation; // 1.5.0 fn main() { for g in “नमस्ते्”.graphemes(true) { println!(“- {}”, g); } } (Playground, note: the playground editor can’t properly handle the string, so the cursor position is wrong in this one line) This prints: – न – म – स् – ते् The … Read more