Opened 15 years ago

Closed 9 years ago

#2864 closed enhancement (fixed)

ogr2ogr, shp > kml: broken XML, because of wrong encoding declaration

Reported by: peifer Owned by: warmerdam
Priority: normal Milestone:
Component: OGR_SF Version: 1.6.0
Severity: normal Keywords:
Cc:

Description

I see from Ticket #1494 that the issue is known. I just raise it again:

I am trying to convert shp files to kml and end up with *a lot* of broken XML files, as ogr2ogr doesn't seem to make an effort to detect the dbf file's character encoding and just dumps the attribute values into the kml file and declares the result to be UTF-8 encoded, which is wrong in most cases. At least over here in Europe, where people like to stick a lot of accented characters into the attribute values.

I am trying to fix this issue by using a quick and dirty shell script, which is guessing the source encoding from the language identifier, i.e. byte 29 from the dbf file header and subsequently using iconv to convert characters into proper UTF-8 encoding.

I am just wondering: couldn't this be built into the shapefile driver, or am I missing something? (I would guess that most likely the latter is the case.)

Change History (10)

comment:1 by warmerdam, 15 years ago

The correct solution starts in the shapefile driver, but .dbf encoding handling is rather involved, and there is no one with the skills and time prepared to work on this issue at this time.

in reply to:  1 comment:2 by peifer, 15 years ago

Replying to warmerdam:

...and there is no one with the skills and time prepared to work on this issue at this time.

OK. I see that it then would make sense to invest a bit more in my dirty encoding-guessing shell script. The basic approach is: either I can make some sense out of the dbf's byte 29, or I assume CP1252, which is a good guess, these days.

Just in case someone would be interested in the results from converting ~1000 shp files, collected from all over Europe:

ogr2ogr, out of the box: ~50% well-formed XML, 0% valid XML

ogr2ogr, plus sh script: ~95% well-formed and valid XML

By the way: does the dbf driver, or some other piece of code strip potential control chars, which would be illegal character data in the KML/GML files? The quite popular non-breaking space comes into my mind...

comment:3 by warmerdam, 15 years ago

I am not aware of any logic in OGR that would alter string attribute characters in normal processing.

in reply to:  3 comment:4 by peifer, 15 years ago

Replying to warmerdam:

I am not aware of any logic in OGR that would alter string attribute characters in normal processing.

It looks like <, > and & from my test.dbf file are properly escaped, when appearing as character data in the KML file. This is good.

<SimpleData name="N7">&lt;</SimpleData>
<SimpleData name="N8">&gt;</SimpleData>
<SimpleData name="N9">&amp;</SimpleData>

comment:5 by peifer, 15 years ago

Here is what I am currently doing on top of ogr2ogr -f KML output in order to generate valid KML. This shell script is perhaps not too elegant, but it works in practice.

A hint from my testing: the encoding-guessing via the dbf file's header byte 29 seems to be unreliable, as not all shapefile-generating applications are setting this byte to an appropriate value. Nevertheless, the recoding through iconv -c works at least in so far, as iconv strips all invalid characters.

#!/bin/bash
#
# Generate a valid KML file based on ogr2ogr conversion
# plus some dirty quick hacks for fixing typical errors
# Hermann, March 2009


# Check if we have a big shape file. Experience shows
# that file size kml file: ~3-4 * file size shape file
# KML files > 100M might be too large for the user's PC
#
FILESIZE=$(stat -c%s "$1")

if [[ $FILESIZE -gt 30000000 ]]
	then
		printf "%s\n" "KML file can not be generated: The shape file is too large ($FILESIZE bytes)" > /dev/stderr
		exit 1
fi


# Check if we have a prj file, which would be nice
#
if [[ -f "${1%shp}prj" || -f "${1%SHP}PRJ" ]]
	then
		t_srs="-t_srs EPSG:4326"
	else
		t_srs=
fi


# See if we have a dbf file and make a guess on its encoding, based on
# code pages listed in the ArcGIS v9, ArcPad Reference Guide
# http://downloads.esri.com/support/documentation/pad_/ArcPad_RefGuide_1105.pdf
#
if [[ ! -f "${1%shp}dbf" &&  ! -f "${1%SHP}DBF" ]]
	then
		# Default value
		FROM_CODE=ASCII
	else
		# Use Language Driver Identifiers (LGID), in dbf file header, byte 29
		LGID=$( hexdump -n1 -s29 -C "${1%shp}dbf" | head -1 | awk '{print $2}' )

		# Translate LGID into a code page
		FROM_CODE=$( awk -v LGID="$LGID" 'BEGIN {

		CP["0x01"] = "CP437"	# U.S. MS–DOS
		CP["0x02"] = "CP850"	# International MS–DOS
		CP["0x03"] = "CP1252"	# Windows ANSI
		CP["0x08"] = "CP865"	# Danish OEM
		CP["0x09"] = "CP437"	# Dutch OEM
		CP["0x0A"] = "CP850"	# Dutch OEM*
		CP["0x0B"] = "CP437"	# Finnish OEM
		CP["0x0D"] = "CP437"	# French OEM
		CP["0x0E"] = "CP850"	# French OEM*
		CP["0x0F"] = "CP437"	# German OEM
		CP["0x10"] = "CP850"	# German OEM*
		CP["0x11"] = "CP437"	# Italian OEM
		CP["0x12"] = "CP850"	# Italian OEM*
		CP["0x13"] = "CP932"	# Japanese Shift-JIS
		CP["0x14"] = "CP850"	# Spanish OEM*
		CP["0x15"] = "CP437"	# Swedish OEM
		CP["0x16"] = "CP850"	# Swedish OEM*
		CP["0x17"] = "CP865"	# Norwegian OEM
		CP["0x18"] = "CP437"	# Spanish OEM
		CP["0x19"] = "CP437"	# English OEM (Britain)
		CP["0x1A"] = "CP850"	# English OEM (Britain)*
		CP["0x1B"] = "CP437"	# English OEM (U.S.)
		CP["0x1C"] = "CP863"	# French OEM (Canada)
		CP["0x1D"] = "CP850"	# French OEM*
		CP["0x1F"] = "CP852"	# Czech OEM
		CP["0x22"] = "CP852"	# Hungarian OEM
		CP["0x23"] = "CP852"	# Polish OEM
		CP["0x24"] = "CP860"	# Portuguese OEM
		CP["0x25"] = "CP850"	# Portuguese OEM*
		CP["0x26"] = "CP866"	# Russian OEM
		CP["0x37"] = "CP850"	# English OEM (U.S.)*
		CP["0x40"] = "CP852"	# Romanian OEM
		CP["0x4D"] = "CP936"	# Chinese GBK (PRC)
		CP["0x4E"] = "CP949"	# Korean (ANSI/OEM)
		CP["0x4F"] = "CP950"	# Chinese Big5 (Taiwan)
		CP["0x50"] = "CP874"	# Thai (ANSI/OEM)
		CP["0x57"] = "CP1252"	# ANSI
		CP["0x58"] = "CP1252"	# Western European ANSI
		CP["0x59"] = "CP1252"	# Spanish ANSI
		CP["0x64"] = "CP852"	# Eastern European MS–DOS
		CP["0x65"] = "CP866"	# Russian MS–DOS
		CP["0x66"] = "CP865"	# Nordic MS–DOS
		CP["0x67"] = "CP861"	# Icelandic MS–DOS
		CP["0x6A"] = "CP737"	# Greek MS–DOS (437G)
		CP["0x6B"] = "CP857"	# Turkish MS–DOS
		CP["0x6C"] = "CP863"	# French–Canadian MS–DOS
		CP["0x78"] = "CP950"	# Taiwan Big 5
		CP["0x79"] = "CP949"	# Hangul (Wansung)
		CP["0x7A"] = "CP936"	# PRC GBK
		CP["0x7B"] = "CP932"	# Japanese Shift-JIS
		CP["0x7C"] = "CP874"	# Thai Windows/MS–DOS
		CP["0x86"] = "CP737"	# Greek OEM
		CP["0x87"] = "CP852"	# Slovenian OEM
		CP["0x88"] = "CP857"	# Turkish OEM
		CP["0xC8"] = "CP1250"	# Eastern European Windows
		CP["0xC9"] = "CP1251"	# Russian Windows
		CP["0xCA"] = "CP1254"	# Turkish Windows
		CP["0xCB"] = "CP1253"	# Greek Windows
		CP["0xCC"] = "CP1257"	# Baltic Windows

		LGID = "0x" toupper(LGID)

		# Use code page if available, default = CP1252 = Windows ANSI
		print LGID in CP ? CP[LGID] : "CP1252" }' )
fi


# Make ogr2ogr write to tmp file, transform to WGS84, if source SRS is known
#
ogr2ogr -f KML -skipfailures tmp.kml "$1"  $t_srs  2>tmp.err

# Use ogr2ogr exit code and decide what to do
#
if [[ $? == 0 ]]
	then
		# Remove Style elements from ogr2ogr output
		# in order to avoid schema validation errors
		cat tmp.kml | grep -v Style |

		#  Convert to UTF-8 encoding 
		iconv -c -f $FROM_CODE -t UTF-8 |

		# Use Awk hack for fixing the order of Schema and Folder elements
		awk  '
			NR == 3	{ folder = orig = $0 ; sub("<Document>", "", folder) ; next }
			NR == 4	{ print ( /Schema/ ? "<Document>" : orig ) ORS $0 ; next }
			/^<\/Schema>/ { print $0 ORS folder ; next }
			{ print } '
	else
		# Return the error message and remove tmp file
		cat tmp.err
fi

# Remove potential tmp file(s)
rm -rf tmp.kml tmp.err

exit 0

comment:6 by Even Rouault, 15 years ago

See #2971 that forces XML output to be ASCII if it is not valid UTF-8.

in reply to:  6 comment:7 by peifer, 15 years ago

Replying to rouault:

See #2971 that forces XML output to be ASCII if it is not valid UTF-8.

I assume that what is described in ticket #2971 is some sort of a (temporary) quick fix? Reducing 100000+ Unicode characters to less than 100 ASCII characters is perhaps not the ultimate solution...

comment:8 by Even Rouault, 15 years ago

No, it's not the ultimate solution. Hopefully work will be done at some point in the shapefile driver so it can properly decode data from the DBF encoding to UTF-8. However #2971 is not totally useless, as it checks that people using the OGR API (and not directly ogr2ogr) don't feed XML based drivers with invalid data.

comment:9 by Jukka Rahkonen, 9 years ago

Perhaps these schema errors have been corrected later, GDAL having a badge: "OGC KML 2.2.0 (Official Reference Implementation)"

http://www.opengeospatial.org/resource/products/details/?pid=1218

comment:10 by Even Rouault, 9 years ago

Resolution: fixed
Status: newclosed

The issue is more about shapefile recoding, which has been implemented since then. Closing as fixed

Note: See TracTickets for help on using tickets.