Ticket #467 (closed patch: invalid)

Opened 2 years ago

Last modified 22 months ago

shp2pgsql: command line option to cope with broken strings in dbf files

Reported by: agnat Owned by: strk
Priority: medium Milestone: PostGIS 1.5.2
Component: postgis Version: 1.5.X
Keywords: shp2pgsql Cc:

Description

Struggling with a shape (dbf) file with broken string encodings I created he attached patch. It adds a command line option to set icons error policy:

shp2pgsql -C ignore broken.shp some_table

The above command drops broken characters.

Attachments

shp2pgsql_iconv_error_policy.patch Download (4.7 KB) - added by agnat 2 years ago.
shp2pgsql_iconv_error_policy.patch2 Download (5.5 KB) - added by strk 23 months ago.
Cleaned up
shp2pgsql_iconv_error_policy.patch3 Download (5.5 KB) - added by agnat 22 months ago.
more clean-up

Change History

Changed 2 years ago by agnat

Changed 23 months ago by strk

  • owner changed from pramsey to strk
  • status changed from new to assigned

I was planning to do something about it myself !!!

Free software never stops surprising me :D

Changed 23 months ago by strk

Cleaned up

Changed 23 months ago by strk

I've attached a version which applies to trunk and avoids useless allocation of a new string.

But... this patch won't help in the case I describe here:  http://strk.keybit.net/blog/2010/04/04/unicode-from-osm-to-pgsql/ At least not with my version of iconv. Could you try with yours ?

In the case described in the post the input contains truncated multibyte strings, and a simple policy for those (imho) would be to just discard the final partial substrings.

Changed 22 months ago by agnat

more clean-up

Changed 22 months ago by agnat

patch3 does not expose iconvs transliterate option on the command line anymore. First i thought it's a good idea to expose all error policies for maximum flexibility. However, on a second thought I can not come up with a real use-case for the transliterate policy. That's because the target format is always utf-8 and there really is no need to transliterate to utf-8.

If anybody has a use-case that requires transliteration we can easily add this option back.

Although strk already fixed things for broken utf-8 input this patch is still useful when dealing with non utf-8 data. For example the  gadm shapefiles contain garbled strings in windows-1252 encoding. To import these files and just ignore characters that don't have a utf-8 representation set the -C option.

Changed 22 months ago by strk

Could you please mention a specific shapefile from gadm to take a look ?

Changed 22 months ago by strk

I've read the description on the page and sent an email. I think they are wrong, you do can put utf-8 in .dbf w/out problems. We want to fix those shapefiles!

Changed 22 months ago by agnat

You could store anything <=255 bytes in a string field.  Other sources state that ASCII is the right encoding. However, if you say it's common practice to use utf-8 in dbf/shapefiles (like the geofabrik guys do) i'm happy. I'm a geo-noob, you're a committer.

I used the  world file (475MB) in my tests, but i guess it affects every file that uses non ascii characters (and that has not been fixed yet). I think the strings in the germany file are broken, too. But I have to verify that.

Changed 22 months ago by agnat

The  Mali file is corrupted and a lot smaller.

$ shp2pgsql MLI_adm1.shp > /dev/null
Shapefile type: Polygon
Postgis type: MULTIPOLYGON[2]
Unable to convert field value "..." to UTF-8: iconv reports "Illegal byte sequence"

Changed 22 months ago by strk

Tried passing -W Latin1 to the shp2pgsql call ?

Changed 22 months ago by agnat

  • status changed from assigned to closed
  • resolution set to invalid

Ok, that works.

Note: See TracTickets for help on using tickets.