Opened 15 years ago
Closed 15 years ago
#467 closed patch (invalid)
shp2pgsql: command line option to cope with broken strings in dbf files
Reported by: | agnat | Owned by: | strk |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 1.5.2 |
Component: | postgis | Version: | 1.5.X |
Keywords: | shp2pgsql | Cc: |
Description
Struggling with a shape (dbf) file with broken string encodings I created he attached patch. It adds a command line option to set icons error policy:
shp2pgsql -C ignore broken.shp some_table
The above command drops broken characters.
Attachments (3)
Change History (12)
by , 15 years ago
Attachment: | shp2pgsql_iconv_error_policy.patch added |
---|
comment:1 by , 15 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:2 by , 15 years ago
I've attached a version which applies to trunk and avoids useless allocation of a new string.
But… this patch won't help in the case I describe here: http://strk.keybit.net/blog/2010/04/04/unicode-from-osm-to-pgsql/ At least not with my version of iconv. Could you try with yours ?
In the case described in the post the input contains truncated multibyte strings, and a simple policy for those (imho) would be to just discard the final partial substrings.
comment:3 by , 15 years ago
patch3 does not expose iconvs transliterate option on the command line anymore. First i thought it's a good idea to expose all error policies for maximum flexibility. However, on a second thought I can not come up with a real use-case for the transliterate policy. That's because the target format is always utf-8 and there really is no need to transliterate to utf-8.
If anybody has a use-case that requires transliteration we can easily add this option back.
Although strk already fixed things for broken utf-8 input this patch is still useful when dealing with non utf-8 data. For example the gadm shapefiles contain garbled strings in windows-1252 encoding. To import these files and just ignore characters that don't have a utf-8 representation set the -C option.
comment:4 by , 15 years ago
Could you please mention a specific shapefile from gadm to take a look ?
comment:5 by , 15 years ago
I've read the description on the page and sent an email. I think they are wrong, you do can put utf-8 in .dbf w/out problems. We want to fix those shapefiles!
comment:6 by , 15 years ago
You could store anything ⇐255 bytes in a string field. Other sources state that ASCII is the right encoding. However, if you say it's common practice to use utf-8 in dbf/shapefiles (like the geofabrik guys do) i'm happy. I'm a geo-noob, you're a committer.
I used the world file (475MB) in my tests, but i guess it affects every file that uses non ascii characters (and that has not been fixed yet). I think the strings in the germany file are broken, too. But I have to verify that.
comment:7 by , 15 years ago
The Mali file is corrupted and a lot smaller.
$ shp2pgsql MLI_adm1.shp > /dev/null Shapefile type: Polygon Postgis type: MULTIPOLYGON[2] Unable to convert field value "..." to UTF-8: iconv reports "Illegal byte sequence"
I was planning to do something about it myself !!!
Free software never stops surprising me