#808 closed defect (fixed)
shp2pgsql and encoding, something must be wrong
Reported by: | nicklas | Owned by: | pramsey |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.0.0 |
Component: | postgis | Version: | master |
Keywords: | Cc: |
Description
I think enncoding is very frustrating and hard to understand. It is likely that this is invalid, but I have twisted it around so many times so now I think something is wrong.
I have attached a simple dbf-file with one field called "address" and one row with the text: "Tårneby in Våler i Solør kommune"
if I first try to use shp2pgsql just ignoring the funny letters, like:
nicklas@ubuntu64:~/Documents$ /usr/lib/postgresql/8.4/bin/shp2pgsql test.dbf>test.sql
I get the error message :
Unable to convert data value to UTF-8 (iconv reports "Invalid or incomplete multibyte or wide character"). Current encoding is "UTF-8". Try "LATIN1" (Western European), or one of the values described at http://www.postgresql.org/docs/current/static/multibyte.html.
If I do like this:
nicklas@ubuntu64:~/Documents$ /usr/lib/postgresql/9.0/bin/shp2pgsql -W LATIN1 test.dbf>test.sql
the sql file is produced like this:
SET CLIENT_ENCODING TO UTF8; SET STANDARD_CONFORMING_STRINGS TO ON; BEGIN; CREATE TABLE "test" (gid serial PRIMARY KEY, "address" varchar(32)); INSERT INTO "test" ("address") VALUES ('Tårneby in Våler I Solør kommune'); COMMIT;
The problem is that psql won't load this sql-file into the database complaining like this:
psql:test.sql:6: ERROR: invalid byte sequence for encoding "UTF8": 0xe5726e
So, what I have to do is changing first row in sql file to client_encoding LATIN1 instead. The everything works.
According to PostGIS doc shp2pgsql is supposed to convert to UTF8 in the sql file so psql can load UTF8. I don't think it works that way. shp2pgsql does nothing about the actual encoding but should tell postgresql about the original encoding.
The behavior of today makes it impossible to use shp2pgsql-gui since there is no way to edit the sql-file. You will get one error or another no matter what encoding you declare.
I have only tried this on trunk version.
What I don't understand is if some local settings in my system makes things different. But I think DEPESZ explanation here makes sense and my experience agrees with it.
Thanks Nicklas
Attachments (3)
Change History (21)
by , 14 years ago
comment:1 by , 14 years ago
Nicklas,
I'll give this one a try. My understanding was that shp2pgsql looks at the encoding and translates the encoding from what was specified LATIN1 to UTF8 so that by the time it gets to the .sql export it is already in UTF-8. That is why you don't see anything different in the sql file (except that the text should have been rewritten as UTF8). I could be wrong though.
Anyrate that is my understanding why we compile shp2pgsql with iconv support so it can do that translation. I never did quite understand why we don't just have PostgreSQL do the translation (or rather I forgot why) except its less portable (e.g. I think WIN1252 is valid on windows and not on unix (or at least not PostgreSQL linux)
Have you tried the 1.5 version to see if this is just an issue in trunk? So much stuff has changed in trunk.
comment:2 by , 14 years ago
Just to add (at least on windows), I have a feeling that somehow in trunk, the order of arguments became relevant. Look at my ticket: #659 (though I have to verify its still an issue), and your ticket #779 I wonder if its the symptom of same thing that some arguments are either being ignored or their order matters.
(oh and sorry didn't read more closely at your comment — I see you haven't tested on 1.5 and your impressions are the same as mine).
comment:3 by , 14 years ago
Okay I think this is a new issue. I can load in the file I autogenerated with my 1.5 shp2pgsql but can't load in the one created with my trunk shp2pgsql. So I suspect maybe in trunk the -W switch is being ignored.
For contrast compare I have attaced both .sql files generated.
comment:4 by , 14 years ago
Thanks Regina
Sorry for not trying 1.5 myself.
Ok, There seems to be several issues with loader in trunk.
/Nicklas
comment:5 by , 14 years ago
Yah and I'm hoping they are all symptoms of the same issue so we can close all those tickets at once
comment:6 by , 14 years ago
I think Paul recently made some changes in trunk with the aim of trying to make things a bit more friendly - can you do an SVN bisect to find the bad revision? If so, I'll go take a look.
comment:7 by , 14 years ago
Yes, I will try to dig a little deeper, but it will not be right now.
/Nicklas
comment:9 by , 14 years ago
I checked my builds and all of them are bad so I'll have to go back even further.
So it looks like it was on or before
http://trac.osgeo.org/postgis/timeline?from=2010-09-03 and the only change I can see before that time that could cause this is: r5450 in March 22 2010. That doesn't make sense either since that should have shown up as a failure in 1.5.2 release (but that is okay). Unless the issue is not in the loader folder.
comment:10 by , 14 years ago
I'd like to point out that our regression test does support testing loader and dumper. Cases like this one would be nice to see appearing in the testsuite, so to stabilize.
comment:11 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Fixed in trunk, revision 6932.
I added a regression test using the attached DBF file (thanks nicklas!).
comment:12 by , 14 years ago
On MingW the regress is failing — here is the output of the diff
--- loader/Latin1.select.expected Fri Mar 18 16:06:43 2011 +++ /tmp/pgis_reg_5684/test_22_out Fri Mar 18 16:13:07 2011 @@ -1 +1 @@ -1|TÃ¥rneby in VÃ¥ler I Solør kommune +1|Tårneby in Våler I Solør kommune
comment:13 by , 14 years ago
You know what's bizarre the test_22_out looks like the right output (Tårneby in Våler I Solør kommune). So not sure where that other is coming from. It might actually be an issue with the encoding of the expected file or the comparing. So I think this might be a false negative.
comment:14 by , 14 years ago
I believe you're correct. I think the file you're getting out of SVN is somehow different from the one I checked in, probably because SVN is doing some encoding translation wrong. You can set encodings in SVN, I'll need to figure out what it should be set to. Either that or try marking it as a binary file so nothing gets translated.
comment:15 by , 14 years ago
It should be latin1 which I think is ISO-8859-1. I'll try to resave the file as that on my end and see if that test passes.
comment:16 by , 14 years ago
I wonder if it just inherits the default of the OS (similar to what happens with the linefeed which is why I have to explicitly tag that). When I check in my editor it registers as Win1252.
comment:17 by , 14 years ago
well it works if I correct the expected file and set encoding to latin 1. So its the expected itself that is the problem.
comment:18 by , 14 years ago
Interesting. What did you do to the expected file, just set the encoding to latin1? Via svn:propset or what?
I'm not sure if MinGW has the "od" command, but if it does could you run it on both files?
od <expected>
od <_out>
dbf with Norwegian letters