Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#808 closed defect (fixed)

shp2pgsql and encoding, something must be wrong

Reported by: nicklas Owned by: pramsey
Priority: medium Milestone: PostGIS 2.0.0
Component: postgis Version: master
Keywords: Cc:

Description

I think enncoding is very frustrating and hard to understand. It is likely that this is invalid, but I have twisted it around so many times so now I think something is wrong.

I have attached a simple dbf-file with one field called "address" and one row with the text: "Tårneby in Våler i Solør kommune"

if I first try to use shp2pgsql just ignoring the funny letters, like:

nicklas@ubuntu64:~/Documents$ /usr/lib/postgresql/8.4/bin/shp2pgsql test.dbf>test.sql

I get the error message :

Unable to convert data value to UTF-8 (iconv reports "Invalid or incomplete multibyte or wide character"). Current encoding is "UTF-8". Try "LATIN1" (Western European), or one of the values described at http://www.postgresql.org/docs/current/static/multibyte.html.

If I do like this:

nicklas@ubuntu64:~/Documents$ /usr/lib/postgresql/9.0/bin/shp2pgsql -W LATIN1 test.dbf>test.sql

the sql file is produced like this:

SET CLIENT_ENCODING TO UTF8;
SET STANDARD_CONFORMING_STRINGS TO ON;
BEGIN;
CREATE TABLE "test" (gid serial PRIMARY KEY,
"address" varchar(32));
INSERT INTO "test" ("address") VALUES ('Tårneby in Våler I Solør kommune');
COMMIT;

The problem is that psql won't load this sql-file into the database complaining like this:

psql:test.sql:6: ERROR:  invalid byte sequence for encoding "UTF8": 0xe5726e

So, what I have to do is changing first row in sql file to client_encoding LATIN1 instead. The everything works.

According to PostGIS doc shp2pgsql is supposed to convert to UTF8 in the sql file so psql can load UTF8. I don't think it works that way. shp2pgsql does nothing about the actual encoding but should tell postgresql about the original encoding.

The behavior of today makes it impossible to use shp2pgsql-gui since there is no way to edit the sql-file. You will get one error or another no matter what encoding you declare.

I have only tried this on trunk version.

What I don't understand is if some local settings in my system makes things different. But I think DEPESZ explanation here makes sense and my experience agrees with it.

Thanks Nicklas

Attachments (3)

test.dbf (99 bytes) - added by nicklas 9 years ago.
dbf with Norwegian letters
test152.sql (238 bytes) - added by robe 9 years ago.
generated with shp2pgsql from postgis 1.5.2
testtrunk.sql (235 bytes) - added by robe 9 years ago.
shp2pgsql from trunk

Download all attachments as: .zip

Change History (21)

Changed 9 years ago by nicklas

Attachment: test.dbf added

dbf with Norwegian letters

comment:1 Changed 9 years ago by robe

Nicklas,

I'll give this one a try. My understanding was that shp2pgsql looks at the encoding and translates the encoding from what was specified LATIN1 to UTF8 so that by the time it gets to the .sql export it is already in UTF-8. That is why you don't see anything different in the sql file (except that the text should have been rewritten as UTF8). I could be wrong though.

Anyrate that is my understanding why we compile shp2pgsql with iconv support so it can do that translation. I never did quite understand why we don't just have PostgreSQL do the translation (or rather I forgot why) except its less portable (e.g. I think WIN1252 is valid on windows and not on unix (or at least not PostgreSQL linux)

Have you tried the 1.5 version to see if this is just an issue in trunk? So much stuff has changed in trunk.

comment:2 Changed 9 years ago by robe

Just to add (at least on windows), I have a feeling that somehow in trunk, the order of arguments became relevant. Look at my ticket: #659 (though I have to verify its still an issue), and your ticket #779 I wonder if its the symptom of same thing that some arguments are either being ignored or their order matters.

(oh and sorry didn't read more closely at your comment -- I see you haven't tested on 1.5 and your impressions are the same as mine).

comment:3 Changed 9 years ago by robe

Okay I think this is a new issue. I can load in the file I autogenerated with my 1.5 shp2pgsql but can't load in the one created with my trunk shp2pgsql. So I suspect maybe in trunk the -W switch is being ignored.

For contrast compare I have attaced both .sql files generated.

Changed 9 years ago by robe

Attachment: test152.sql added

generated with shp2pgsql from postgis 1.5.2

Changed 9 years ago by robe

Attachment: testtrunk.sql added

shp2pgsql from trunk

comment:4 Changed 9 years ago by nicklas

Thanks Regina

Sorry for not trying 1.5 myself.

Ok, There seems to be several issues with loader in trunk.

/Nicklas

comment:5 Changed 9 years ago by robe

Yah and I'm hoping they are all symptoms of the same issue so we can close all those tickets at once :)

comment:6 Changed 9 years ago by mcayland

I think Paul recently made some changes in trunk with the aim of trying to make things a bit more friendly - can you do an SVN bisect to find the bad revision? If so, I'll go take a look.

comment:7 Changed 9 years ago by nicklas

Yes, I will try to dig a little deeper, but it will not be right now.

/Nicklas

comment:8 Changed 9 years ago by robe

I'll try to give it a look too in the next couple of days.

comment:9 Changed 9 years ago by robe

I checked my builds and all of them are bad so I'll have to go back even further.

So it looks like it was on or before

http://trac.osgeo.org/postgis/timeline?from=2010-09-03 and the only change I can see before that time that could cause this is: r5450 in March 22 2010. That doesn't make sense either since that should have shown up as a failure in 1.5.2 release (but that is okay). Unless the issue is not in the loader folder.

comment:10 Changed 9 years ago by strk

I'd like to point out that our regression test does support testing loader and dumper. Cases like this one would be nice to see appearing in the testsuite, so to stabilize.

comment:11 Changed 9 years ago by jadams

Resolution: fixed
Status: newclosed

Fixed in trunk, revision 6932.

I added a regression test using the attached DBF file (thanks nicklas!).

comment:12 Changed 9 years ago by robe

On MingW the regress is failing -- here is the output of the diff

--- loader/Latin1.select.expected	Fri Mar 18 16:06:43 2011
+++ /tmp/pgis_reg_5684/test_22_out	Fri Mar 18 16:13:07 2011
@@ -1 +1 @@
-1|Tårneby in Våler I Solør kommune
+1|Tårneby in Våler I Solør kommune

comment:13 Changed 9 years ago by robe

You know what's bizarre the test_22_out looks like the right output (Tårneby in Våler I Solør kommune). So not sure where that other is coming from. It might actually be an issue with the encoding of the expected file or the comparing. So I think this might be a false negative.

comment:14 Changed 9 years ago by jadams

I believe you're correct. I think the file you're getting out of SVN is somehow different from the one I checked in, probably because SVN is doing some encoding translation wrong. You can set encodings in SVN, I'll need to figure out what it should be set to. Either that or try marking it as a binary file so nothing gets translated.

comment:15 Changed 9 years ago by robe

It should be latin1 which I think is ISO-8859-1. I'll try to resave the file as that on my end and see if that test passes.

comment:16 Changed 9 years ago by robe

I wonder if it just inherits the default of the OS (similar to what happens with the linefeed which is why I have to explicitly tag that). When I check in my editor it registers as Win1252.

comment:17 Changed 9 years ago by robe

well it works if I correct the expected file and set encoding to latin 1. So its the expected itself that is the problem.

comment:18 Changed 9 years ago by jadams

Interesting. What did you do to the expected file, just set the encoding to latin1? Via svn:propset or what?

I'm not sure if MinGW has the "od" command, but if it does could you run it on both files?

od <expected>

od <_out>

Note: See TracTickets for help on using tickets.