python-unicode - Row Coding

Python 3: os.walk() file paths UnicodeEncodeError: ‘utf-8’ codec can’t encode: surrogates not allowed

On Linux, filenames are ‘just a bunch of bytes’, and are not necessarily encoded in a particular encoding. Python 3 tries to turn everything into Unicode strings. In doing so the developers came up with a scheme to translate byte strings to Unicode strings and back without loss, and without knowing the original encoding. They …

UnicodeDecodeError: (‘utf-8’ codec) while reading a csv file [duplicate]

by Tarik

Known encoding If you know the encoding of the file you want to read in, you can use pd.read_csv(‘filename.txt’, encoding=’encoding’) These are the possible encodings: https://docs.python.org/3/library/codecs.html#standard-encodings Unknown encoding If you do not know the encoding, you can try to use chardet, however this is not guaranteed to work. It is more a guess work. import …

Correctly reading text from Windows-1252(cp1252) file in python

by Tarik

CP1252 cannot represent ā; your input contains the similar character â. repr just displays an ASCII representation of a unicode string in Python 2.x: >>> print(repr(b’J\xe2nis’.decode(‘cp1252′))) u’J\xe2nis’ >>> print(b’J\xe2nis’.decode(‘cp1252’)) Jânis

Removing unicode \u2026 like characters in a string in python2.7 [duplicate]

by Tarik

Python 2.x >>> s ‘This is some \\u03c0 text that has to be cleaned\\u2026! it\\u0027s annoying!’ >>> print(s.decode(‘unicode_escape’).encode(‘ascii’,’ignore’)) This is some text that has to be cleaned! it’s annoying! Python 3.x >>> s=”This is some \u03c0 text that has to be cleaned\u2026! it\u0027s annoying!” >>> s.encode(‘ascii’, ‘ignore’) b”This is some text that has to be …

UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x80 in position 3131: invalid start byte

by Tarik

In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.

Python string argument without an encoding

by Tarik

You are passing in a string object to a bytearray(): bytearray(content[current_pos:(final_pos)]) You’ll need to supply an encoding argument (second argument) so that it can be encoded to bytes. For example, you could encode it to UTF-8: bytearray(content[current_pos:(final_pos)], ‘utf8’) From the bytearray() documentation: The optional source parameter can be used to initialize the array in a …

Python string to unicode [duplicate]

by Tarik

Unicode escapes only work in unicode strings, so this a=”\u2026″ is actually a string of 6 characters: ‘\’, ‘u’, ‘2’, ‘0’, ‘2’, ‘6’. To make unicode out of this, use decode(‘unicode-escape’): a=”\u2026″ print repr(a) print repr(a.decode(‘unicode-escape’)) ## ‘\\u2026′ ## u’\u2026’

“SyntaxError: Non-ASCII character …” or “SyntaxError: Non-UTF-8 code starting with …” trying to use non-ASCII text in a Python script

by Tarik

I’d recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more …

Why does ENcoding a string result in a DEcoding error (UnicodeDecodeError)?

by Tarik

“你好”.encode(‘utf-8′) encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don’t have the u). So python has to convert the string to a unicode object first. So it does the equivalent of “你好”.decode().encode(‘utf-8′) But the decode fails because the string isn’t valid ascii. …

Python – ‘ascii’ codec can’t decode byte

by Tarik