Opened 14 years ago

Closed 14 years ago

#467 closed patch (invalid)

shp2pgsql: command line option to cope with broken strings in dbf files

Reported by: agnat Owned by: strk
Priority: medium Milestone: PostGIS 1.5.2
Component: postgis Version: 1.5.X
Keywords: shp2pgsql Cc:

Description

Struggling with a shape (dbf) file with broken string encodings I created he attached patch. It adds a command line option to set icons error policy:

shp2pgsql -C ignore broken.shp some_table

The above command drops broken characters.

Attachments (3)

shp2pgsql_iconv_error_policy.patch (4.7 KB ) - added by agnat 14 years ago.
shp2pgsql_iconv_error_policy.patch2 (5.5 KB ) - added by strk 14 years ago.
Cleaned up
shp2pgsql_iconv_error_policy.patch3 (5.5 KB ) - added by agnat 14 years ago.
more clean-up

Download all attachments as: .zip

Change History (12)

comment:1 by strk, 14 years ago

Owner: changed from pramsey to strk
Status: newassigned

I was planning to do something about it myself !!!

Free software never stops surprising me :D

by strk, 14 years ago

Cleaned up

comment:2 by strk, 14 years ago

I've attached a version which applies to trunk and avoids useless allocation of a new string.

But… this patch won't help in the case I describe here: http://strk.keybit.net/blog/2010/04/04/unicode-from-osm-to-pgsql/ At least not with my version of iconv. Could you try with yours ?

In the case described in the post the input contains truncated multibyte strings, and a simple policy for those (imho) would be to just discard the final partial substrings.

by agnat, 14 years ago

more clean-up

comment:3 by agnat, 14 years ago

patch3 does not expose iconvs transliterate option on the command line anymore. First i thought it's a good idea to expose all error policies for maximum flexibility. However, on a second thought I can not come up with a real use-case for the transliterate policy. That's because the target format is always utf-8 and there really is no need to transliterate to utf-8.

If anybody has a use-case that requires transliteration we can easily add this option back.

Although strk already fixed things for broken utf-8 input this patch is still useful when dealing with non utf-8 data. For example the gadm shapefiles contain garbled strings in windows-1252 encoding. To import these files and just ignore characters that don't have a utf-8 representation set the -C option.

comment:4 by strk, 14 years ago

Could you please mention a specific shapefile from gadm to take a look ?

comment:5 by strk, 14 years ago

I've read the description on the page and sent an email. I think they are wrong, you do can put utf-8 in .dbf w/out problems. We want to fix those shapefiles!

comment:6 by agnat, 14 years ago

You could store anything ⇐255 bytes in a string field. Other sources state that ASCII is the right encoding. However, if you say it's common practice to use utf-8 in dbf/shapefiles (like the geofabrik guys do) i'm happy. I'm a geo-noob, you're a committer.

I used the world file (475MB) in my tests, but i guess it affects every file that uses non ascii characters (and that has not been fixed yet). I think the strings in the germany file are broken, too. But I have to verify that.

comment:7 by agnat, 14 years ago

The Mali file is corrupted and a lot smaller.

$ shp2pgsql MLI_adm1.shp > /dev/null
Shapefile type: Polygon
Postgis type: MULTIPOLYGON[2]
Unable to convert field value "..." to UTF-8: iconv reports "Illegal byte sequence"

comment:8 by strk, 14 years ago

Tried passing -W Latin1 to the shp2pgsql call ?

comment:9 by agnat, 14 years ago

Resolution: invalid
Status: assignedclosed

Ok, that works.

Note: See TracTickets for help on using tickets.