Opened 10 years ago
Closed 9 years ago
#2617 closed defect (fixed)
wxgui Raster query redirect to console UnicodeDecodeError
Reported by: | marisn | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | 7.0.3 |
Component: | wxGUI | Version: | svn-trunk |
Keywords: | query, encoding, python, gettext | Cc: | |
CPU: | Unspecified | Platform: | MSWindows Vista |
Description
Steps to reproduce:
- use raster query tool to query raster
- check "redirect to console"
Vaicājuma rezultāti: Traceback (most recent call last): File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\gui_core\query.py", line 65, in <lambda> self.redirect.Bind(wx.EVT_CHECKBOX, lambda evt: self._onRedirect(evt.IsChecked())) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\gui_core\query.py", line 143, in _onRedirect self.redirectOutput.emit(output=self._textToRedirect()) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\gui_core\query.py", line 148, in _textToRedirect text = printResults(self._model, self._colNames[1]) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\gui_core\query.py", line 215, in printResults return '\n'.join(textList) UnicodeDecodeError : 'ascii' codec can't decode byte 0xc4 in position 4: ordinal not in range(128)
Also reported as a part of #2120; 7.0.0 is also affected.
Change History (16)
comment:1 by , 10 years ago
Keywords: | query encoding added |
---|
follow-up: 3 comment:2 by , 10 years ago
No, this is not a solution - it is still broken.
I appended a following print in query.py L185:
print 'k: %s (%s) v: %s (%s)' % (k, type(k), v, type(v))
And here is output:
k: east, north (<type 'unicode'>) v: 622578.672986, 6399325.43444 (<type 'str'>) k: dores_idw@kalistrats (<type 'unicode'>) v: {'nosaukums': '', 'kr\xc4\x81sa': '255:202:000', 'v\xc4\x93rt\xc4\xabba': '71.6390742964988'} (<type 'dict'>) k: nosaukums (<type 'str'>) v: (<type 'str'>) k: krāsa (<type 'str'>) v: 255:202:000 (<type 'str'>) Traceback (most recent call last): File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\mapwin\buffered.py", line 1230, in MouseActions self.OnLeftUp(event) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\mapwin\buffered.py", line 1407, in OnLeftUp self.mapQueried.emit(x=self.mouse['end'][0], y=self.mouse['end'][1]) File "C:\Program Files\GRASS GIS 7.1.svn\etc\python\grass\pydispatch\signal.py", line 229, in emit dispatcher.send(signal=self, *args, **kwargs) File "C:\Program Files\GRASS GIS 7.1.svn\etc\python\grass\pydispatch\dispatcher.py", line 349, in send **named File "C:\Program Files\GRASS GIS 7.1.svn\etc\python\grass\pydispatch\robustapply.py", line 60, in robustApply return receiver(*arguments, **named) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\mapdisp\frame.py", line 868, in Query self.QueryMap(east, north, qdist, rast, vect) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\mapdisp\frame.py", line 922, in QueryMap self.dialogs['query'] = QueryDialog(parent = self, data = result) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\gui_core\query.py", line 46, in __init__ self._model = QueryTreeBuilder(self.data, column=self._colNames[1]) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\gui_core\query.py", line 201, in QueryTreeBuilder addNode(parent=model.root, data=part, model=model) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\gui_core\query.py", line 190, in addNode addNode(parent=node, data=v, model=model) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\gui_core\query.py", line 187, in addNode k = DecodeString(k) File "C:\Program Files\GRASS GIS 7.1.svn\gui\wxpython\core\gcmd.py", line 76, in DecodeString return string.decode(_enc) File "C:\Program Files\GRASS GIS 7.1.svn\Python27\lib\encodings\cp1257.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeDecodeError : 'charmap' codec can't decode byte 0x81 in position 3: character maps to <undefined>
As it is visible from the output, on the second line v contains UTF-8 encoded text. Lines 3 and 4 report it to be a str and thus a DecodeString is called. So far - nothing bad, but there kicks in DecodeString - it is using GetSystemEncoding to decode string. On this system _enc variable is set to cp1257 - this is definitely not UTF-8 and thus decoding fails. The string in question (krāsa) is coming form the GRASS translation to Latvian language - to reproduce the issue on your system, you must translate "color" to a word with non-ascii letters in it (zbarvenã) and, of course, encode translation file (PO) as UTF-8.
The source of problem is r47310 where instead of installing unicode version of gettext a bytestring version is installed. This should work fine, but now in every place where a _() call is made, it returns str for unicode translations. Reverting r47310 fixes this bug (and probably others too!) without any problems, still I would like to hear Glynn's rationale why it was necessary in the first place (preferably with patches that solve _() issue if r47310 is to stay). Not using unicode version of gettext is really strange, as Slovenian is the only language NOT using UTF-8 in their PO files and it has seen the last update in 2005, thus GRASS PO files ARE unicode-ready.
comment:3 by , 10 years ago
Replying to marisn:
The source of problem is r47310 where instead of installing unicode version of gettext a bytestring version is installed. This should work fine, but now in every place where a _() call is made, it returns str for unicode translations. Reverting r47310 fixes this bug (and probably others too!) without any problems, still I would like to hear Glynn's rationale why it was necessary in the first place (preferably with patches that solve _() issue if r47310 is to stay).
The scripting library only uses byte strings, never unicode. Values returned from _() are typically written to streams (stdout/stderr or files) or used as command-line arguments. These contexts invariably require byte strings, so if _() returned a unicode value it will just get converted to a byte string using the default encoding (not the locale's encoding or filesystem encoding etc), which is usually ASCII. So prior to r47310, any attempt by a script to use a translated string while in a non-English locale was likely to result in the familiar "codec can't encode character ..." error.
If there's a bug here, it's wxGUI expecting the grass.script library to cater to it. grass.script doesn't exist for the benefit of wxGUI. If grass.script isn't suitable for wxGUI (e.g. because of wxPython's use of unicode), wxGUI should provide its own alternatives, not break grass.script.
But the real question is: where is that UTF-8 coming from? On Windows, nothing should ever see UTF-8, as Windows doesn't support UTF-8 as an actual codepage (cp65001 is a pseudo-codepage which exists to allow certain functions to use UTF-8; but you can't have a locale which uses cp65001 as its codepage).
Byte strings which end up in wxGUI should be interpreted as using the locale's codepage (cp1257 in this case), as should anything converted from unicode to a byte string by wxGUI. Anything coming from wxPython (e.g. the contents of a text field) should be unicode values (UTF-16-LE internally).
Not using unicode version of gettext is really strange, as Slovenian is the only language NOT using UTF-8 in their PO files and it has seen the last update in 2005, thus GRASS PO files ARE unicode-ready.
The encoding used in PO files doesn't matter on systems which use GNU gettext, which will automatically convert from the encoding used in the PO file to the locale's encoding (so a single PO file can be used for both e.g. en_GB.utf8 and en_GB.iso88591). In fact, the encoding used in PO files shouldn't even be visible to applications (unless they're trying to read the PO file directly rather than using gettext, which would be dumb).
Ideally, PO files should use the locale's legacy encoding (e.g ISO-8859-1 for most of Western Europe). Newer systems will translate that to UTF-8 if that's what the locale uses; older systems will just copy the data verbatim, so it needs to use the locale's encoding (which, on older systems, won't be UTF-8). This has the added advantage of restricting what goes into those files to characters which can actually be displayed.
follow-up: 5 comment:4 by , 10 years ago
Just dropping a note here as it needs further investigation: https://docs.python.org/2/library/gettext.html#gettext-vs-lgettext
In Python 2.4 the lgettext() family of functions were introduced. The intention of these functions is to provide an alternative which is more compliant with the current implementation of GNU gettext. Unlike gettext(), which returns strings encoded with the same codeset used in the translation file, lgettext() will return strings encoded with the preferred system encoding, as returned by locale.getpreferredencoding(). Also notice that Python 2.4 introduces new functions to explicitly choose the codeset used in translated strings. If a codeset is explicitly set, even lgettext() will return translated strings in the requested codeset, as would be expected in the GNU gettext implementation.
Note on "same codeset" explains where the UTF-8 strings are coming from and why it differs from C implementation of gettext.
In the aforementioned document is also another one interesting remark: https://docs.python.org/2/library/gettext.html#the-gnutranslations-class
Note that the Unicode version of the methods (i.e. ugettext() and ungettext()) are the recommended interface to use for internationalized Python programs.
follow-ups: 6 7 9 comment:5 by , 10 years ago
Replying to marisn:
In Python 2.4 the lgettext() family of functions were introduced. The intention of these functions is to provide an alternative which is more compliant with the current implementation of GNU gettext. Unlike gettext(), which returns strings encoded with the same codeset used in the translation file, lgettext() will return strings encoded with the preferred system encoding, as returned by locale.getpreferredencoding().
Right. Unfortunately, gettext.install() binds the _() function to the .gettext() method rather than to the .lgettext() method.
Try r64834.
Note that the Unicode version of the methods (i.e. ugettext() and ungettext()) are the recommended interface to use for internationalized Python programs.
"Recommended" by someone who isn't going to be doing the (substantial) amount of work involved in adding all the required .encode() calls, or dealing with the bugs which arise whenever someone forgets the .encode() call. Because without those calls, unicode values will be converted using implicit conversions, which fails whenever the unicode value contains non-ASCII characters.
As a rough guide, you can (and should) ignore anything the Python developers have to say about Unicode. Their attitude tends to be "everything should use Unicode, and the fact that POSIX (and a lot else) doesn't is your problem and not ours".
follow-up: 8 comment:6 by , 10 years ago
Replying to glynn:
Replying to marisn:
Note that the Unicode version of the methods (i.e. ugettext() and ungettext()) are the recommended interface to use for internationalized Python programs.
"Recommended" by someone who isn't going to be doing the (substantial) amount of work involved in adding all the required .encode() calls, or dealing with the bugs which arise whenever someone forgets the .encode() call. Because without those calls, unicode values will be converted using implicit conversions, which fails whenever the unicode value contains non-ASCII characters.
We have to do this work in any case for python3. We can create a function that explicity convert every input to unicode, something like:
import sys PY2 = sys.version[0] == '2' def to_text_string(obj, encoding=None): """Convert `obj` to (unicode) text string""" if PY2: # Python 2 if encoding is None: return unicode(obj) else: return unicode(obj, encoding) else: # Python 3 if encoding is None: return str(obj) elif isinstance(obj, str): # In case this function is not used properly, this could happen return obj else: return str(obj, encoding)
As a rough guide, you can (and should) ignore anything the Python developers have to say about Unicode. Their attitude tends to be "everything should use Unicode, and the fact that POSIX (and a lot else) doesn't is your problem and not ours".
Many recent computer languages (i.e. Go, Rust) consider this a good practice... and personally I agree with them. In Python3 they fix this implicit conversion, and this is the reason why I believe we should move to python3.
comment:7 by , 10 years ago
Keywords: | python gettext added |
---|
Replying to glynn:
Replying to marisn:
In Python 2.4 the lgettext() family of functions were introduced. The intention of these functions is to provide an alternative which is more compliant with the current implementation of GNU gettext. Unlike gettext(), which returns strings encoded with the same codeset used in the translation file, lgettext() will return strings encoded with the preferred system encoding, as returned by locale.getpreferredencoding().
Right. Unfortunately, gettext.install() binds the _() function to the .gettext() method rather than to the .lgettext() method.
Try r64834:
import gettext gettext.install('grasslibs', os.path.join(os.getenv("GISBASE"), 'locale')) import __builtin__ __builtin__.__dict__['_'] = __builtin__.__dict__['_'].im_self.lgettext
This solves the problem but the fix is yet another reason for me to believe that translation function should be explicitly imported and changing buildins, explicit or hidden, should be avoided. Compare the code above with the code in GUI (r57219 and r57220):
# gui/wxpython/core/utils.py # _ intended to be used also outside this module try: # intended to be used also outside this module import gettext _ = gettext.translation('grasswxpy', os.path.join(os.getenv("GISBASE"), 'locale')).ugettext except IOError: # using no translation silently def null_gettext(string): return string _ = null_gettext
Please see the further discussion in #2425.
comment:8 by , 10 years ago
Replying to zarch:
We have to do this work in any case for python3.
If we actually use it. For most scripting tasks, Python 3 offers nothing but inconvenience.
And even then, there's a much simpler way to deal with it: convert unicode strings to byte strings at the point they arise (there are far fewer of these compared to the number of places where we will need to write byte strings to streams or pass them as command arguments).
We can create a function that explicity convert every input to unicode, something like:
But why bother? At the lowest level, scripts tend to do two things: invoke commands and read/write streams. Both of these deal with byte strings. Converting to unicode then back again just creates unnecessary failure modes; there's no guarantee that data read from a given stream will be in the locale's encoding, or even in any known encoding.
wxGUI has to deal with this because wxPython uses Unicode throughout (and look how many wxGUI issues relate to Unicode{Encode,Decode}Error as a result). The scripting library doesn't need to deal with this; there's no inherent reason why most scripts should ever encounter a unicode value.
comment:9 by , 9 years ago
Replying to glynn:
Replying to marisn:
In Python 2.4 the lgettext() family of functions were introduced. The intention of these functions is to provide an alternative which is more compliant with the current implementation of GNU gettext. Unlike gettext(), which returns strings encoded with the same codeset used in the translation file, lgettext() will return strings encoded with the preferred system encoding, as returned by locale.getpreferredencoding().
Right. Unfortunately, gettext.install() binds the _() function to the .gettext() method rather than to the .lgettext() method.
Try r64834.
Is this a backport candidate?
follow-up: 14 comment:13 by , 9 years ago
I'm still getting this error both in trunk and release70. My locale:
> locale LANG=fr_BE LANGUAGE=fr_BE LC_CTYPE=fr_BE.UTF-8 LC_NUMERIC=C LC_TIME=fr_BE.UTF-8 LC_COLLATE=fr_BE.UTF-8 LC_MONETARY=fr_BE.UTF-8 LC_MESSAGES=fr_BE.UTF-8 LC_PAPER=fr_BE.UTF-8 LC_NAME=fr_BE.UTF-8 LC_ADDRESS=fr_BE.UTF-8 LC_TELEPHONE=fr_BE.UTF-8 LC_MEASUREMENT=fr_BE.UTF-8 LC_IDENTIFICATION=fr_BE.UTF-8 LC_ALL=
I only get an error with a raster map with labels (e.g. landclass96), presumably because 'label' is translated to 'étiquette' in French.
Here's the entire backtrace:
Traceback (most recent call last): File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux-gnu/gui/wxpython/mapwin/buffered.py", line 1234, in MouseActions self.OnLeftUp(event) File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux-gnu/gui/wxpython/mapwin/buffered.py", line 1411, in OnLeftUp self.mapQueried.emit(x=self.mouse['end'][0], y=self.mouse['end'][1]) File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux-gnu/etc/python/grass/pydispatch/signal.py", line 229, in emit dispatcher.send(signal=self, *args, **kwargs) File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux- gnu/etc/python/grass/pydispatch/dispatcher.py", line 349, in send **named File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux- gnu/etc/python/grass/pydispatch/robustapply.py", line 60, in robustApply return receiver(*arguments, **named) File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux-gnu/gui/wxpython/mapdisp/frame.py", line 868, in Query self.QueryMap(east, north, qdist, rast, vect) File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux-gnu/gui/wxpython/mapdisp/frame.py", line 920, in QueryMap self.dialogs['query'].SetData(result) File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux-gnu/gui/wxpython/gui_core/query.py", line 87, in SetData self.redirectOutput.emit(output=self._textToRedirect()) File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux-gnu/gui/wxpython/gui_core/query.py", line 148, in _textToRedirect text = printResults(self._model, self._colNames[1]) File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64 -unknown-linux-gnu/gui/wxpython/gui_core/query.py", line 215, in printResults return '\n'.join(textList) UnicodeDecodeError : 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
Please try r64818 in trunk. Any chance it would solve #2601?