Opened 13 years ago

Closed 6 years ago

#1488 closed defect (fixed)

v.in.dxf wrapper not working

Reported by: cmbarton Owned by: grass-dev@…
Priority: normal Milestone: 7.0.7
Component: wxGUI Version: svn-trunk
Keywords: v.in.dxf Cc:
CPU: Unspecified Platform: Unspecified

Description

I just tried to import a dxf file, using v.in.dxf. When I loaded it into the special GUI wrapper, it did not show up any layers and the import button was greyed out.

Fortunately, I clicked the "command dialog" button. The layers showed up without any problem and the file imported fine.

I'm not sure what this wrapper adds to v.in.dxf. But it is not working correctly.

Michael

Change History (24)

comment:1 by martinl, 13 years ago

Keywords: v.in.dxf added

Do you mean

File -> Import vector data -> DXF import

?

It's working for me (testing with GRASS 7).

comment:2 by cmbarton, 13 years ago

OK. This seems to be some kind of OS X 10.7 (AKA Lion) issue. It does work fine on my OS X 10.6 (Snow Leopard) machine it was compiled on, but gives the following error on my Lion laptop.

Traceback (most recent call last):
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/python/wx/lib/filebrowsebutton.py", line 140, in OnChanged

self.changeCallback(evt)
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/gui/wxpython/gui_modules/gdialogs.py", line 1946, in
OnSetDsn

input = path)
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/gui/wxpython/gui_modules/gcmd.py", line 611, in RunCommand

stdout, stderr = map(utils.DecodeString, ps.communicate())
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/gui/wxpython/gui_modules/utils.py", line 665, in
DecodeString

return string.decode(enc)
  File "/System/Library/Frameworks/Python.framework/Versions
/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode

return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError
:
'utf8' codec can't decode byte 0xc1 in position 77: invalid
start byte

What is the function of "Decodestring" in this context? Is it necessary?

I'll try building under Lion to see if this goes away.

Michael

comment:3 by cmbarton, 13 years ago

AFter looking into this, it appears that it is NOT an OSX 10.7 issue but a bug. However, for some reason, the bug only shows up on my OSX 10.7 box.

The DecodeString method in util.py calls string.decode()

However, the string module does not have a "decode" method in Python 2.6 or 2.7. This is what is causing the error. What is this supposed to be doing? The value that string.decode() is supposed to be decoding is "mac-roman" on my machine.

Michael

in reply to:  3 comment:4 by glynn, 13 years ago

Replying to cmbarton:

AFter looking into this, it appears that it is NOT an OSX 10.7 issue but a bug. However, for some reason, the bug only shows up on my OSX 10.7 box.

The DecodeString method in util.py calls string.decode()

However, the string module does not have a "decode" method in Python 2.6 or 2.7.

In this context, "string" is the name of DecodeString()'s parameter, not the name of a module. It is presumably meant to be a string; since 2.2, Python strings have had a .decode() method (link).

This is what is causing the error. What is this supposed to be doing?

Decoding the string (converting from a sequence of bytes to a sequence of Unicode characters).

The value that string.decode() is supposed to be decoding is "mac-roman" on my machine.

The argument is the name of the encoding; the string on which the .decode() method is called is the string which is decoded. So string.decode('mac-roman') converts the variable "string" from a sequence of bytes in the "mac-roman" encoding to a Unicode string.

Mac-roman is a unibyte encoding (one byte equals one character), so it shouldn't be possible for the decoding to fail (you just get mojibake if the string isn't actually in the mac-roman encoding). But the traceback in comment:2 indicates that the UTF-8 encoding is being used. UTF-8 is a multibyte encoding which reserves specific bytes for specific positions (e.g. 0-127 are single-byte characters, 192-255 are used for the first byte of a multi-byte character, 128-191 are used for subsequent bytes of a multi-byte character). If a byte appears in the wrong position, you get an error.

in reply to:  2 ; comment:5 by glynn, 13 years ago

Replying to cmbarton:

'utf8' codec can't decode byte 0xc1 in position 77: invalid start byte

0xc1 is in the range used for start bytes. However, a start byte of 0xc1 implies that the top two bits of the resulting character are "01", which would mean that it was in the range 0x40-0x7F (64-127), so it should be encoded as a single-byte character. "Strict" UTF-8 decoders disallow the use of over-long encodings (i.e. encoding a character using more bytes than is strictly necessary), requiring encoded representations to be unique (i.e. character 0x40 must be encoded as '\x40', and not as e.g. '\xc1\x00').

However, it's far more likely that command is outputting non-UTF-8 data, which the GUI is then attempting to parse as UTF-8. In general, GRASS doesn't care about encodings; it just deals with bytes. Unfortunately, this isn't something which can readily be fixed. GRASS often has to deal with textual data which has no associated encoding, so there's no way it can convert it to the locale's encoding (the wx GUI assumes that all text is in the locale's encoding).

in reply to:  5 comment:6 by glynn, 13 years ago

Replying to glynn:

In general, GRASS doesn't care about encodings; it just deals with bytes. Unfortunately, this isn't something which can readily be fixed.

I should have mentioned that while it can't actually be "fixed", it can fail less hard. The str.decode() method accepts an optional second argument, "errors", which can be "strict" (raise an exception), "replace" (replace unconvertible sequences with U+FFFD) or ignore (silently discard unconvertible sequences).

The default is "strict". utils.DecodeString should probably use one of the others, particularly if the output is just text which is going to be shown to the user. "strict" should only be used in situations where an error is preferable to an "approximate" value.

comment:7 by cmbarton, 13 years ago

After a bit of testing, using 'ignore' works, but 'replace' creates further errors down the line

Traceback (most recent call last):
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/python/wx/lib/filebrowsebutton.py", line 140, in OnChanged

self.changeCallback(evt)
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/gui/wxpython/gui_modules/gdialogs.py", line 1962, in
OnSetDsn

grassName = utils.GetValidLayerName(layerName)
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/gui/wxpython/gui_modules/utils.py", line 171, in
GetValidLayerName

retName = str(name).strip()
UnicodeEncodeError
:
'ascii' codec can't encode character u'\ufffd' in position
0: ordinal not in range(128)

Unless there is a compelling reason to do otherwise, I suggest using 'ignore'. Thanks for the insight Glynn.

Also, since the "string" module is imported, using "string" as a variable name in a method so as to produce a code line like

string.decode(enc)

seems potentially confusing to those reading the code at least, and potentially to the program. Maybe the variable should be changed to "str" or something along that line.

comment:8 by cmbarton, 13 years ago

It looks to me that the previous issues/solutions also apply to the StringEncode(string) method in until.py

Michael

comment:9 by cmbarton, 13 years ago

Sorry, I meant the

EncodeString(string):

method

comment:10 by cmbarton, 13 years ago

Any reason that I should not go ahead and make these suggested changes in until.py?

Michael

in reply to:  7 ; comment:11 by glynn, 13 years ago

Replying to cmbarton:

After a bit of testing, using 'ignore' works, but 'replace' creates further errors down the line

'ascii' codec can't encode character u'\ufffd' in position
0: ordinal not in range(128)

Unless there is a compelling reason to do otherwise, I suggest using 'ignore'.

One option is:

string.decode(enc, 'replace').replace(u'\ufffd',u'?')

This will replace unconvertible sequences with question marks, which provides a clue that the string is an approximation (this is how e.g. "ls" behaves if it needs to list filenames which contain sequences which cannot be converted according to the current locale).

However, if the EncodeString method is changed to also use 'replace', the U+FFFD characters can be left alone:

> 'hello\xc1world'.decode('utf-8', 'replace')
u'hello\ufffdworld'
> 'hello\xc1world'.decode('utf-8', 'replace').encode('ascii', 'replace')
'hello?world'

Also, since the "string" module is imported, using "string" as a variable name in a method so as to produce a code line like

string.decode(enc)

seems potentially confusing to those reading the code at least, and potentially to the program. Maybe the variable should be changed to "str" or something along that line.

"str" is the actual name of Python's native string type, and of its constructor, str().

It doesn't really matter if a parameter or local name shadows a global or built-in name, so long as you don't need to actually use the name being shadowed within the function.

For complex functions, it's easy to forget that a built-in name is being shadowed, resulting in obscure errors (e.g. "'str' object is not callable" if you try to use the map() function when "map" is actually a local variable containing a string). But EncodeString is only 7 lines.

Also, utils.py only imports the "string" module for the join() and split functions, which are used in normalize_whitespace(). Presumably it could just use the .join() and .split() methods, i.e.:

return ' '.join(text.split())

in reply to:  11 ; comment:12 by cmbarton, 13 years ago

Replying to glynn:

Replying to cmbarton:

After a bit of testing, using 'ignore' works, but 'replace' creates further errors down the line

'ascii' codec can't encode character u'\ufffd' in position
0: ordinal not in range(128)

Unless there is a compelling reason to do otherwise, I suggest using 'ignore'.

One option is:

string.decode(enc, 'replace').replace(u'\ufffd',u'?')

This will replace unconvertible sequences with question marks, which provides a clue that the string is an approximation (this is how e.g. "ls" behaves if it needs to list filenames which contain sequences which cannot be converted according to the current locale).

This works fine.

However, if the EncodeString method is changed to also use 'replace', the U+FFFD characters can be left alone:

Changing EncodeString to use replace does not work. There is still the same error reported above.

> 'hello\xc1world'.decode('utf-8', 'replace')
u'hello\ufffdworld'
> 'hello\xc1world'.decode('utf-8', 'replace').encode('ascii', 'replace')
'hello?world'

Also, since the "string" module is imported, using "string" as a variable name in a method so as to produce a code line like

string.decode(enc)

seems potentially confusing to those reading the code at least, and potentially to the program. Maybe the variable should be changed to "str" or something along that line.

"str" is the actual name of Python's native string type, and of its constructor, str().

It doesn't really matter if a parameter or local name shadows a global or built-in name, so long as you don't need to actually use the name being shadowed within the function.

For complex functions, it's easy to forget that a built-in name is being shadowed, resulting in obscure errors (e.g. "'str' object is not callable" if you try to use the map() function when "map" is actually a local variable containing a string). But EncodeString is only 7 lines.

Also, utils.py only imports the "string" module for the join() and split functions, which are used in normalize_whitespace(). Presumably it could just use the .join() and .split() methods, i.e.:

return ' '.join(text.split())

This seems like a good idea to simplify the code some.

Michael

in reply to:  12 comment:13 by glynn, 13 years ago

Replying to cmbarton:

Changing EncodeString to use replace does not work. There is still the same error reported above.

That's because utils.GetValidLayerName() just uses str() rather than EncodeString() or similar. str() uses the default encoding (not the locale's encoding), which is usually ASCII. And (AFAICT) it uses the equivalent of errors='strict', so that needs to be fixed regardless of how DecodeString() behaves. Otherwise, you'll get a similar error for any non-ASCII character.

comment:14 by cmbarton, 13 years ago

Well, changing GetValidLayerName sort of works in that it does display all the dxf layers now. But it throws another layer in yet another place

Traceback (most recent call last):
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/python/wx/lib/filebrowsebutton.py", line 140, in OnChanged

self.changeCallback(evt)
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/gui/wxpython/gui_modules/gdialogs.py", line 1965, in
OnSetDsn

self.list.LoadData(data)
  File "/Applications/GRASS/GRASS-7.0.app/Contents/MacOS/etc
/gui/wxpython/gui_modules/gdialogs.py", line 2014, in
LoadData

self.SetStringItem(index, i, "%s" % str(item[i]))
UnicodeEncodeError
:
'ascii' codec can't encode character u'\ufffd' in position
0: ordinal not in range(128)

I fear that this could go on and on, and then pop up somewhere else. Is there a cost that actually matters to either using 'ignore' in DecodeString and EncodeString or using .replace(u'\ufffd',u'?') in both cases?

Michael

in reply to:  14 comment:15 by glynn, 13 years ago

Replying to cmbarton:

I fear that this could go on and on, and then pop up somewhere else. Is there a cost that actually matters to either using 'ignore' in DecodeString and EncodeString or using .replace(u'\ufffd',u'?') in both cases?

Using 'ignore' will silently ignore conversion errors. The user may then try to use the names shown in the GUI and have no clue as to why that doesn't work. Using a replacement will alert the user to the fact that the name shown is an approximation.

Using U+FFFD as the replacement (instead of e.g. a question mark) will make it easier to locate all of the places which are using str() instead of EncodeString(). "Hiding" these errors will make it harder to fix them.

str(x) (where x is a Unicode string) is roughly equivalent to x.encode(sys.getdefaultencoding(), 'strict'), which in turn is usually equivalent to x.encode('ascii', 'strict').

In most of the places where str() is used, it's debatable whether either the 'ascii' encoding or the 'strict' error behaviour are the correct choice. In all probability, str() was used in order to duck the issue. But if the GUI is to actually work in non-ASCII locales (meaning: the user is working with files with non-ASCII names and/or non-ASCII contents; as opposed to just having localised menus and labels), then the issues need to be addressed.

comment:16 by cmbarton, 13 years ago

I'm not quite sure what you are suggesting as a solution.

Do we need to search through all modules and replace str() with str.decode() or utility.DecodeString()? Or is this just needed in a few spots?

Or do we just need to change DecodeString() (and EncodeString() too???) to use string.encode(enc, 'replace').replace(u'\ufffd',u+'\fffd')?

Michael

in reply to:  16 comment:17 by glynn, 13 years ago

Replying to cmbarton:

I'm not quite sure what you are suggesting as a solution.

Do we need to search through all modules and replace str() with str.decode() or utility.DecodeString()?

Yes.

Or is this just needed in a few spots?

I don't know which cases (if any) should use str(). Someone will need to examine them and make a determination based upon how the result will be used.

In most cases, str() is the wrong solution. Correct solutions are:

  • If you know that a specific encoding should be used, use it.
  • If you get to choose the encoding, UTF-8 is preferable, as it can encode all Unicode characters.
  • If you don't have any other information, the locale's encoding is the best guess.

Historically, str() was the last resort when you require a byte string. If you're passed a Unicode string, the user has screwed up, but using str() may be able to recover from it providing the string only contains ASCII characters. In Python 3.x, they realised that this was just encouraging people to write broken code, and removed implicit byte-string to/from unicode-string conversions. Standard library functions and methods which expect one or the other will just raise an exception if you pass the wrong sort.

Or do we just need to change DecodeString() (and EncodeString() too???) to use string.encode(enc, 'replace').replace(u'\ufffd',u+'\fffd')?

Both of those functions should just use 'replace' for the error parameter. This will replace invalid sequences with U+FFFD when decoding, and all unrepresentable characters (including, but not limited to, U+FFFD) with '?' when encoding.

comment:18 by cmbarton, 13 years ago

There are a lot of str() to look at.

grep ' str(' *.py | wc -l

returns 203 in gui_modules alone

It looks like most of them are aimed at making sure that a text string instead of a number is returned.

Michael

in reply to:  18 comment:19 by glynn, 13 years ago

Replying to cmbarton:

There are a lot of str() to look at.

grep ' str(' *.py | wc -l

returns 203 in gui_modules alone

The leading space is artificially depressing that figure.

The expressions '\<' and '\>' can be used to match the beginning and end of a "word". In 7.0,

grep '\<str(' *.py | wc -l

gives 294, but that isn't quite all of them, as there are some uses of "map(str, ...)" and similar. Using

grep '\<str\>' *.py | wc -l

finds 332, although that also includes some false positives where "str" is used as a variable name.

It looks like most of them are aimed at making sure that a text string instead of a number is returned.

Using str() to create string representations of numeric values is fine. Converting arbitrary Python values to strings (e.g. for debugging) should use repr(). Conversion of unicode strings to byte strings should use the .encode() method.

comment:20 by martinl, 9 years ago

Milestone: 7.0.07.0.5

comment:21 by martinl, 8 years ago

Still the issue?

comment:22 by neteler, 8 years ago

Milestone: 7.0.57.0.6

comment:23 by neteler, 7 years ago

Milestone: 7.0.67.0.7

comment:24 by martinl, 6 years ago

Resolution: fixed
Status: newclosed

No activity for a long time. Closing. Feel free to reopen if needed.

Note: See TracTickets for help on using tickets.