Opened 5 years ago

Closed 4 years ago

#2617 closed defect (fixed)

wxgui Raster query redirect to console UnicodeDecodeError

Reported by: marisn Owned by: grass-dev@…
Priority: normal Milestone: 7.0.3
Component: wxGUI Version: svn-trunk
Keywords: query, encoding, python, gettext Cc:
CPU: Unspecified Platform: MSWindows Vista

Description

Steps to reproduce:

  • use raster query tool to query raster
  • check "redirect to console"
Vaicājuma rezultāti:
Traceback (most recent call last):
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 65, in
<lambda>

self.redirect.Bind(wx.EVT_CHECKBOX, lambda evt:
self._onRedirect(evt.IsChecked()))
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 143, in
_onRedirect

self.redirectOutput.emit(output=self._textToRedirect())
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 148, in
_textToRedirect

text = printResults(self._model, self._colNames[1])
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 215, in
printResults

return '\n'.join(textList)
UnicodeDecodeError
:
'ascii' codec can't decode byte 0xc4 in position 4: ordinal
not in range(128)

Also reported as a part of #2120; 7.0.0 is also affected.

Change History (16)

comment:1 Changed 5 years ago by annakrat

Keywords: query encoding added

Please try r64818 in trunk. Any chance it would solve #2601?

comment:2 Changed 5 years ago by marisn

No, this is not a solution - it is still broken.

I appended a following print in query.py L185:

print 'k: %s (%s) v: %s (%s)' % (k, type(k), v, type(v))

And here is output:

k: east, north (<type 'unicode'>) v: 622578.672986, 6399325.43444 (<type 'str'>)
k: dores_idw@kalistrats (<type 'unicode'>) v: {'nosaukums': '', 'kr\xc4\x81sa': '255:202:000', 'v\xc4\x93rt\xc4\xabba': '71.6390742964988'} (<type 'dict'>)
k: nosaukums (<type 'str'>) v:  (<type 'str'>)
k: krāsa (<type 'str'>) v: 255:202:000 (<type 'str'>)
Traceback (most recent call last):
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\mapwin\buffered.py", line 1230, in
MouseActions

self.OnLeftUp(event)
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\mapwin\buffered.py", line 1407, in
OnLeftUp

self.mapQueried.emit(x=self.mouse['end'][0],
y=self.mouse['end'][1])
  File "C:\Program Files\GRASS GIS
7.1.svn\etc\python\grass\pydispatch\signal.py", line 229, in
emit

dispatcher.send(signal=self, *args, **kwargs)
  File "C:\Program Files\GRASS GIS
7.1.svn\etc\python\grass\pydispatch\dispatcher.py", line
349, in send

**named
  File "C:\Program Files\GRASS GIS
7.1.svn\etc\python\grass\pydispatch\robustapply.py", line
60, in robustApply

return receiver(*arguments, **named)
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\mapdisp\frame.py", line 868, in Query

self.QueryMap(east, north, qdist, rast, vect)
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\mapdisp\frame.py", line 922, in
QueryMap

self.dialogs['query'] = QueryDialog(parent = self, data =
result)
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 46, in
__init__

self._model = QueryTreeBuilder(self.data,
column=self._colNames[1])
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 201, in
QueryTreeBuilder

addNode(parent=model.root, data=part, model=model)
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 190, in
addNode

addNode(parent=node, data=v, model=model)
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\gui_core\query.py", line 187, in
addNode

k = DecodeString(k)
  File "C:\Program Files\GRASS GIS
7.1.svn\gui\wxpython\core\gcmd.py", line 76, in DecodeString

return string.decode(_enc)
  File "C:\Program Files\GRASS GIS
7.1.svn\Python27\lib\encodings\cp1257.py", line 15, in
decode

return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError
:
'charmap' codec can't decode byte 0x81 in position 3:
character maps to <undefined>

As it is visible from the output, on the second line v contains UTF-8 encoded text. Lines 3 and 4 report it to be a str and thus a DecodeString? is called. So far - nothing bad, but there kicks in DecodeString? - it is using GetSystemEncoding? to decode string. On this system _enc variable is set to cp1257 - this is definitely not UTF-8 and thus decoding fails. The string in question (krāsa) is coming form the GRASS translation to Latvian language - to reproduce the issue on your system, you must translate "color" to a word with non-ascii letters in it (zbarvenã) and, of course, encode translation file (PO) as UTF-8.

The source of problem is r47310 where instead of installing unicode version of gettext a bytestring version is installed. This should work fine, but now in every place where a _() call is made, it returns str for unicode translations. Reverting r47310 fixes this bug (and probably others too!) without any problems, still I would like to hear Glynn's rationale why it was necessary in the first place (preferably with patches that solve _() issue if r47310 is to stay). Not using unicode version of gettext is really strange, as Slovenian is the only language NOT using UTF-8 in their PO files and it has seen the last update in 2005, thus GRASS PO files ARE unicode-ready.

comment:3 in reply to:  2 Changed 5 years ago by glynn

Replying to marisn:

The source of problem is r47310 where instead of installing unicode version of gettext a bytestring version is installed. This should work fine, but now in every place where a _() call is made, it returns str for unicode translations. Reverting r47310 fixes this bug (and probably others too!) without any problems, still I would like to hear Glynn's rationale why it was necessary in the first place (preferably with patches that solve _() issue if r47310 is to stay).

The scripting library only uses byte strings, never unicode. Values returned from _() are typically written to streams (stdout/stderr or files) or used as command-line arguments. These contexts invariably require byte strings, so if _() returned a unicode value it will just get converted to a byte string using the default encoding (not the locale's encoding or filesystem encoding etc), which is usually ASCII. So prior to r47310, any attempt by a script to use a translated string while in a non-English locale was likely to result in the familiar "codec can't encode character ..." error.

If there's a bug here, it's wxGUI expecting the grass.script library to cater to it. grass.script doesn't exist for the benefit of wxGUI. If grass.script isn't suitable for wxGUI (e.g. because of wxPython's use of unicode), wxGUI should provide its own alternatives, not break grass.script.

But the real question is: where is that UTF-8 coming from? On Windows, nothing should ever see UTF-8, as Windows doesn't support UTF-8 as an actual codepage (cp65001 is a pseudo-codepage which exists to allow certain functions to use UTF-8; but you can't have a locale which uses cp65001 as its codepage).

Byte strings which end up in wxGUI should be interpreted as using the locale's codepage (cp1257 in this case), as should anything converted from unicode to a byte string by wxGUI. Anything coming from wxPython (e.g. the contents of a text field) should be unicode values (UTF-16-LE internally).

Not using unicode version of gettext is really strange, as Slovenian is the only language NOT using UTF-8 in their PO files and it has seen the last update in 2005, thus GRASS PO files ARE unicode-ready.

The encoding used in PO files doesn't matter on systems which use GNU gettext, which will automatically convert from the encoding used in the PO file to the locale's encoding (so a single PO file can be used for both e.g. en_GB.utf8 and en_GB.iso88591). In fact, the encoding used in PO files shouldn't even be visible to applications (unless they're trying to read the PO file directly rather than using gettext, which would be dumb).

Ideally, PO files should use the locale's legacy encoding (e.g ISO-8859-1 for most of Western Europe). Newer systems will translate that to UTF-8 if that's what the locale uses; older systems will just copy the data verbatim, so it needs to use the locale's encoding (which, on older systems, won't be UTF-8). This has the added advantage of restricting what goes into those files to characters which can actually be displayed.

comment:4 Changed 5 years ago by marisn

Just dropping a note here as it needs further investigation: https://docs.python.org/2/library/gettext.html#gettext-vs-lgettext

In Python 2.4 the lgettext() family of functions were introduced. The intention of these functions is to provide an alternative which is more compliant with the current implementation of GNU gettext. Unlike gettext(), which returns strings encoded with the same codeset used in the translation file, lgettext() will return strings encoded with the preferred system encoding, as returned by locale.getpreferredencoding(). Also notice that Python 2.4 introduces new functions to explicitly choose the codeset used in translated strings. If a codeset is explicitly set, even lgettext() will return translated strings in the requested codeset, as would be expected in the GNU gettext implementation.

Note on "same codeset" explains where the UTF-8 strings are coming from and why it differs from C implementation of gettext.

In the aforementioned document is also another one interesting remark: https://docs.python.org/2/library/gettext.html#the-gnutranslations-class

Note that the Unicode version of the methods (i.e. ugettext() and ungettext()) are the recommended interface to use for internationalized Python programs.

comment:5 in reply to:  4 ; Changed 5 years ago by glynn

Replying to marisn:

In Python 2.4 the lgettext() family of functions were introduced. The intention of these functions is to provide an alternative which is more compliant with the current implementation of GNU gettext. Unlike gettext(), which returns strings encoded with the same codeset used in the translation file, lgettext() will return strings encoded with the preferred system encoding, as returned by locale.getpreferredencoding().

Right. Unfortunately, gettext.install() binds the _() function to the .gettext() method rather than to the .lgettext() method.

Try r64834.

Note that the Unicode version of the methods (i.e. ugettext() and ungettext()) are the recommended interface to use for internationalized Python programs.

"Recommended" by someone who isn't going to be doing the (substantial) amount of work involved in adding all the required .encode() calls, or dealing with the bugs which arise whenever someone forgets the .encode() call. Because without those calls, unicode values will be converted using implicit conversions, which fails whenever the unicode value contains non-ASCII characters.

As a rough guide, you can (and should) ignore anything the Python developers have to say about Unicode. Their attitude tends to be "everything should use Unicode, and the fact that POSIX (and a lot else) doesn't is your problem and not ours".

comment:6 in reply to:  5 ; Changed 5 years ago by zarch

Replying to glynn:

Replying to marisn:

Note that the Unicode version of the methods (i.e. ugettext() and ungettext()) are the recommended interface to use for internationalized Python programs.

"Recommended" by someone who isn't going to be doing the (substantial) amount of work involved in adding all the required .encode() calls, or dealing with the bugs which arise whenever someone forgets the .encode() call. Because without those calls, unicode values will be converted using implicit conversions, which fails whenever the unicode value contains non-ASCII characters.

We have to do this work in any case for python3. We can create a function that explicity convert every input to unicode, something like:

import sys

PY2 = sys.version[0] == '2'

def to_text_string(obj, encoding=None):
    """Convert `obj` to (unicode) text string"""
    if PY2:
        # Python 2
        if encoding is None:
            return unicode(obj)
        else:
            return unicode(obj, encoding)
    else:
        # Python 3
        if encoding is None:
            return str(obj)
        elif isinstance(obj, str):
            # In case this function is not used properly, this could happen
            return obj
        else:
            return str(obj, encoding)

As a rough guide, you can (and should) ignore anything the Python developers have to say about Unicode. Their attitude tends to be "everything should use Unicode, and the fact that POSIX (and a lot else) doesn't is your problem and not ours".

Many recent computer languages (i.e. Go, Rust) consider this a good practice... and personally I agree with them. In Python3 they fix this implicit conversion, and this is the reason why I believe we should move to python3.

comment:7 in reply to:  5 Changed 5 years ago by wenzeslaus

Keywords: python gettext added

Replying to glynn:

Replying to marisn:

In Python 2.4 the lgettext() family of functions were introduced. The intention of these functions is to provide an alternative which is more compliant with the current implementation of GNU gettext. Unlike gettext(), which returns strings encoded with the same codeset used in the translation file, lgettext() will return strings encoded with the preferred system encoding, as returned by locale.getpreferredencoding().

Right. Unfortunately, gettext.install() binds the _() function to the .gettext() method rather than to the .lgettext() method.

Try r64834:

import gettext 
gettext.install('grasslibs', os.path.join(os.getenv("GISBASE"), 'locale'))
import __builtin__
__builtin__.__dict__['_'] = __builtin__.__dict__['_'].im_self.lgettext

This solves the problem but the fix is yet another reason for me to believe that translation function should be explicitly imported and changing buildins, explicit or hidden, should be avoided. Compare the code above with the code in GUI (r57219 and r57220):

# gui/wxpython/core/utils.py
# _ intended to be used also outside this module
try:
    # intended to be used also outside this module
    import gettext
    _ = gettext.translation('grasswxpy', os.path.join(os.getenv("GISBASE"), 'locale')).ugettext
except IOError:
    # using no translation silently
    def null_gettext(string):
        return string
    _ = null_gettext

Please see the further discussion in #2425.

comment:8 in reply to:  6 Changed 5 years ago by glynn

Replying to zarch:

We have to do this work in any case for python3.

If we actually use it. For most scripting tasks, Python 3 offers nothing but inconvenience.

And even then, there's a much simpler way to deal with it: convert unicode strings to byte strings at the point they arise (there are far fewer of these compared to the number of places where we will need to write byte strings to streams or pass them as command arguments).

We can create a function that explicity convert every input to unicode, something like:

But why bother? At the lowest level, scripts tend to do two things: invoke commands and read/write streams. Both of these deal with byte strings. Converting to unicode then back again just creates unnecessary failure modes; there's no guarantee that data read from a given stream will be in the locale's encoding, or even in any known encoding.

wxGUI has to deal with this because wxPython uses Unicode throughout (and look how many wxGUI issues relate to Unicode{Encode,Decode}Error as a result). The scripting library doesn't need to deal with this; there's no inherent reason why most scripts should ever encounter a unicode value.

comment:9 in reply to:  5 Changed 4 years ago by neteler

Replying to glynn:

Replying to marisn:

In Python 2.4 the lgettext() family of functions were introduced. The intention of these functions is to provide an alternative which is more compliant with the current implementation of GNU gettext. Unlike gettext(), which returns strings encoded with the same codeset used in the translation file, lgettext() will return strings encoded with the preferred system encoding, as returned by locale.getpreferredencoding().

Right. Unfortunately, gettext.install() binds the _() function to the .gettext() method rather than to the .lgettext() method.

Try r64834.

Is this a backport candidate?

comment:10 Changed 4 years ago by neteler

Milestone: 7.0.17.0.2

Ticket retargeted after 7.0.1 milestone closed

comment:11 Changed 4 years ago by martinl

Is still this ticket valid?

comment:12 Changed 4 years ago by neteler

Milestone: 7.0.27.0.3

Ticket retargeted after milestone closed

comment:13 Changed 4 years ago by mlennert

I'm still getting this error both in trunk and release70. My locale:

> locale
LANG=fr_BE
LANGUAGE=fr_BE
LC_CTYPE=fr_BE.UTF-8
LC_NUMERIC=C
LC_TIME=fr_BE.UTF-8
LC_COLLATE=fr_BE.UTF-8
LC_MONETARY=fr_BE.UTF-8
LC_MESSAGES=fr_BE.UTF-8
LC_PAPER=fr_BE.UTF-8
LC_NAME=fr_BE.UTF-8
LC_ADDRESS=fr_BE.UTF-8
LC_TELEPHONE=fr_BE.UTF-8
LC_MEASUREMENT=fr_BE.UTF-8
LC_IDENTIFICATION=fr_BE.UTF-8
LC_ALL=

I only get an error with a raster map with labels (e.g. landclass96), presumably because 'label' is translated to 'étiquette' in French.

Here's the entire backtrace:

Traceback (most recent call last):
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-gnu/gui/wxpython/mapwin/buffered.py", line
1234, in MouseActions

self.OnLeftUp(event)
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-gnu/gui/wxpython/mapwin/buffered.py", line
1411, in OnLeftUp

self.mapQueried.emit(x=self.mouse['end'][0],
y=self.mouse['end'][1])
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-gnu/etc/python/grass/pydispatch/signal.py",
line 229, in emit

dispatcher.send(signal=self, *args, **kwargs)
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-
gnu/etc/python/grass/pydispatch/dispatcher.py", line 349, in
send

**named
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-
gnu/etc/python/grass/pydispatch/robustapply.py", line 60, in
robustApply

return receiver(*arguments, **named)
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-gnu/gui/wxpython/mapdisp/frame.py", line 868,
in Query

self.QueryMap(east, north, qdist, rast, vect)
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-gnu/gui/wxpython/mapdisp/frame.py", line 920,
in QueryMap

self.dialogs['query'].SetData(result)
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-gnu/gui/wxpython/gui_core/query.py", line 87,
in SetData

self.redirectOutput.emit(output=self._textToRedirect())
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-gnu/gui/wxpython/gui_core/query.py", line
148, in _textToRedirect

text = printResults(self._model, self._colNames[1])
  File "/home/mlennert/SRC/GRASS/grass_trunk/dist.x86_64
-unknown-linux-gnu/gui/wxpython/gui_core/query.py", line
215, in printResults

return '\n'.join(textList)
UnicodeDecodeError
:
'ascii' codec can't decode byte 0xc3 in position 2: ordinal
not in range(128)

comment:14 in reply to:  13 ; Changed 4 years ago by annakrat

Replying to mlennert:

I'm still getting this error both in trunk and release70. My locale:

Try r67328.

comment:15 in reply to:  14 Changed 4 years ago by mlennert

Replying to annakrat:

Replying to mlennert:

I'm still getting this error both in trunk and release70. My locale:

Try r67328.

Works for me. I also tried with accents in a label and that works as well.

Thanks !

comment:16 Changed 4 years ago by annakrat

Resolution: fixed
Status: newclosed

Backported in r67335.

Note: See TracTickets for help on using tickets.