Opened 13 years ago

Closed 8 years ago

#689 closed defect (worksforme)

NaturalEarth Character Encodings

Reported by: darkblueb Owned by: darkblueb
Priority: normal Milestone: OSGeoLive10.0
Component: OSGeoLive Keywords: natural earth, upstream
Cc: live-demo@…, warmerdam

Description

the NaturalEarth dataset is supplied as .shp files in Windows 1252 encoding

http://en.wikipedia.org/wiki/Windows-1252

a close ISO standard is

http://en.wikipedia.org/wiki/ISO_8859-1

It is desirable to simply hold all tables in postgres as Unicode UTF-8

http://en.wikipedia.org/wiki/UTF-8

Currently, the custom version of the NaturalEarth data set we use does a pretty good job of passing through character encodings.. but on inspection of Vietnamese and Slovenian Admin1 (state) names, there are some problem characters.. How well these characters are represented in other environments, from the .shp files, remains to be seen.

Change history (33)

comment:1 by hamish, 13 years ago

Keywords: natural earth upstream added

comment:2 by kalxas, 12 years ago

Is this still valid in 6.0?

comment:3 by darkblueb, 12 years ago

Owner: changed from live-demo@… to darkblueb

there has been no progress on this bug, so 6.0 will, almost certainly, show the errors.

Their is an updated and improved Natural Earth data set in the pipeline on the Natural Earth side of the fence. When the next revision of Natural Earth data arrives, this bug will have to be re-examined.

comment:4 by darkblueb, 12 years ago

Milestone: OSGeoLive6.5

comment:5 by hamish, 12 years ago

Cc: live-demo@… added

comment:6 by nvkelso, 12 years ago

Windows 1252 is the default encoding for (most) SHP files and Natural Earth distributes there. Windows 1252 should not be confused with ISO_8859-1, they are different and Natural Earth will never be distributed via that ISO standard.

The admin-1 are special as the native (master) GeoDB file is in UTF-8, but ArcMap does a poor mapping onto Windows-1252 when it exports to SHP.

However, now that OGR has bindings for GeoDB, we can export a UTF-8 version, along side the Windows-1252 version.

FYI: Natural Earth 2.0 (forthcoming) includes PostGIS data dump and script for importing all these, with the proper OGR to turn them into UTF-8 prior to import.

This can be previewed here:

https://github.com/nvkelso/natural-earth-vector/tree/master/tools/make-web-mercator-900913-ready/

and

https://github.com/nvkelso/natural-earth-vector/blob/master/tools/make-web-mercator-900913-ready/zip-it.sh

where:

ogr2ogr \

--config OGR_ENABLE_PARTIAL_REPROJECTION TRUE --config SHAPE_ENCODING WINDOWS-1252 \ -t_srs "$P900913" -lco ENCODING=UTF-8 -clipsrc $EXTENT -segmentize 1 -skipfailures \ $base.shp $shapefile

note the encoding bits.

hoping to have the 2.0 pushed by the 18th of this month, the topology has been a bitch to fix.

_n

in reply to:  6 comment:7 by hamish, 12 years ago

Replying to nvkelso:

hoping to have the 2.0 pushed by the 18th of this month, the topology has been a bitch to fix.

have you tried GRASS GIS's cleaning tools? (it is a topologically correct GIS)

:-)

Hamish

comment:8 by nvkelso, 12 years ago

The problem is with GRASS, or any other GIS, is when the topology is built, it will actually result in odd little sliver polygons and the like. It might seem like the topology is now clean, and it is mathematically, but it's not visually or from a longer term data attribute maintenance perspective.

The larger issues is PostGIS is self referentially "correct" version of topology on simple geometry features makes it so data that can be operated on by OGR and ArcGIS can no longer even be read in read-only mode, let alone stored and used for simple geometry tests. If an application like Mapnik can figure out what the winding rule is for drawing the interior of a polygon, surely PostGIS and GEOS can figure this out.

This is a major faulty assumption at the core of FOSS4G that will keep biting new users in the ass every single day going into the future.

in reply to:  8 comment:9 by hamish, 12 years ago

Replying to nvkelso:

The problem is with GRASS, or any other GIS, is when the topology is built, it will actually result in odd little sliver polygons and the like.

that doesn't happen in GRASS actually.. (which is why I recommended it) It supports simple features, but its native model is different. Or to say, it doesn't happen in GRASS -if the imported data is clean-, and it will highlight areas and nodes which are not clean for you, so you can clean it up. (e.g. I seem to recall cleaning up an old iteration of the NE data for some topology errors around the Great Lakes)

It might seem like the topology is now clean, and it is mathematically, but it's not visually or from a longer term data attribute maintenance perspective.

the way GRASS's native vector engine does it is that the boundary between two abutting areas is a single line, not two overlapping lines. so you can eg reproject that area cleanly, then export the result to back into multiple polygon shapefile or whatever without errors due to floating point precision issues etc.

Sorry, I can't speak much about PostGIS and GEOS's vector models, as I'm not an expert in those.

best, Hamish

comment:10 by darkblueb, 12 years ago

here is a list of the NaturalEarth layers that we use on the disk now:

10m_admin_0_countries
10m_admin_1_states_provinces
10m_geography_marine_polys
10m_geography_regions_elevation_points
10m_geography_regions_points
10m_geography_regions_polys
10m_lakes
10m_land
10m_ocean
10m_populated_places_simple
10m_rivers_lake_centerlines
10m_urban_areas

comment:11 by darkblueb, 11 years ago

we may want to add an ISO-8859-1 locale to the XUbuntu environment in the install scripts

comment:12 by darkblueb, 11 years ago

## Live 6.5a4 locale en_US.iso88591 is now added at build time natural_earth2 database in utf8 encoding is present

natural_earth2 is built in Live6.5a4 by importing selected 8859-1 shp files into a utf8 database using shp2pgsql -W LATIN1 . (see load_gisdata.sh) The presence of the en_US.iso88591 locale means that a LATIN1 encoded postgres database may be created, if desirable.

on first inspection, Latin1 characters in natural_earth2 appear to be handled well. However, some characters such as s-caron appear to be present in the NaturalEarth2 data, but are defined in ISO 8859-2 (Latin2), and may not be handled correctly at present. Non-european cases, such as romanized Vietnamese, are yet to be investigated.

comment:13 by nvkelso, 11 years ago

Cool!

See earlier comment about iso8859-1 not being a perfect match for Windows-1252.

The European cases will work themselves out (eg s-caron) over time and may be other funk. However, the romanized Vietnamese doesn't show up well in Windows-1252 in the first place. This only affects the admin-1 (states, provinces) in 2.0 (may change in future versions).

Now that OGR has support for reading GeoDB files, it would be be better to import from the GeoDB original which is in UTF-8 into the PostGIS db directly rather than using the SHP which looses the details on conversion.

comment:14 by kalxas, 11 years ago

Is the recent fix good enough for 6.5 release?

comment:15 by nvkelso, 11 years ago

Yes, let's go with this.

I'll have a new version of the Natural Earth 2.1 admin-1 in a few months that should provide a longer-term solution.

comment:16 by kalxas, 11 years ago

Keywords: 7.0 added; 4.5beta3 removed
Milestone: OSGeoLive6.5OSGeoLive7.0

comment:17 by kalxas, 11 years ago

What is our status on this? Can we close the ticket?

comment:18 by darkblueb, 11 years ago

This issue primarily effects the "admin 1" layer in Natural Earth (generally equivalent to political state or province). I have received a working copy of the next revision of NaturalEarth admin1 in the form of a shapefile, whose .cpg file identifies it as encoded in UTF-8. Tools such as gdal/ogr correctly recognize this code page directly from the shapefile.

A text field called name_local contains many admin1 titles in local script. The newer working table performs substantially better on Live 7. (note that greek and russian glyphs are what I would call "basic" international characters, with arabic and thai in a similar class, but found more rarely on machines in the West. All four of these work correctly right now on the Live.)

One path forward is to replace the current admin1 table completely with the in-progress one I have now.. There are 42 fields in the current admin1, and 20 more than that in the new working copy, including temporary fields meant for development. The shipping admin1 has those kinds of development fields in it, also. It is unclear who or what may depend on the exact contents of the admin1 fields.

comment:19 by darkblueb, 11 years ago

I now have a cleaned admin1 w/62 fields per kelso, with russian corrections, german additions and one greek typo fixed, patched. Next task is to create the file set to replace the file set we use now.

comment:20 by darkblueb, 11 years ago

Now I have express permission from the NaturalEarth team for a pre3.0 Natural Earth. However, there are changes the NE team would like to see, that we have no time for.. and my first pass at integrating left at least one more encoding problem.. The database is declared as LATIN1 and we should move it completely to UTF-8. Sounds easy, but who knows where a small problem may occur? Only thorough testing would show, and the clock is running out.

Note: why do we have load_postgis.sh and load_gisdata.sh as two scripts? I have no idea..

At any rate, with everything else this close to completion, it does not look good for making this change. Attribute it to summer schedules of the NE team I suppose. Very likely will have to push to 7.5.

comment:21 by kalxas, 11 years ago

Milestone: OSGeoLive7.0OSGeoLive7.5

moved to 7.5 Thanks

comment:22 by darkblueb, 10 years ago

Preliminary for Live 7.9:

checked out 10m_physical, 10m_cultural from svn stable and devel

https://github.com/nvkelso/natural-earth-vector

  • sadly, the .gdb dirs are not readable.. due to pre ArcGIS v10 data ?
  • gathered Live 7.0 tables subset (listed above in this ticket)
  • import subset shps via shp2pgsql -W LATIN1, LATIN9, CP1252
  • compared admin_1 states name, name_local,name_alt from each import and Live7
  • compared export of name,name_local,name_alt with ogr2ogr from shp file, and psql COPY from imported database.

SUMMARY: I found that the contents and accuracy of admin_1_states is as good, but not better than, Live 7.0, using any of those approaches. There are at least two broken admin1 names, showing a 00 byte in the unicode (legal utf8 will never have a 00 byte), as well as many hard-to-decipher cases due to character set availability on any given terminal. and display environment quirks.

comment:23 by darkblueb, 10 years ago

Note that the Live7 natural_earth2 subset geometry was "cleaned" for validity using PostGIS ST_MakeValid().

GRASS v.clean() is also available but untested.

comment:24 by warmerdam, 10 years ago

Cc: warmerdam added

comment:25 by darkblueb, 10 years ago

looking for "zeroes", I found these in a particular terminal:

  262 | Å\u009EÉ\u0099ki                             | c3 85 c2 9e c3 89 c2 99 6b 69
  188 | AÄ\u009Fstafa                                | 41 c3 84 c2 9f 73 74 61 66 61
 4526 | Ä\u0090á»\u0093ng Bằng Sông Há»\u0093ng   | c3 84 c2 90 c3 a1 c2 bb c2 93 6e 67 20 42 c3 a1 c2 ba c2 b1 6e 67 20 53 c3 83 c2 b4 6e 67 20 48 c3 a1 c2 bb c2 93 6e 67
 4109 | Ã\u0087anakkale                              | c3 83 c2 87 61 6e 61 6b 6b 61 6c 65
 4139 | Ã\u0087orum                                  | c3 83 c2 87 6f 72 75 6d
 4157 | Ã\u0087ankiri                                | c3 83 c2 87 61 6e 6b 69 72 69

however, a quick lookup in UTF8 shows those chars in the block U+AC00 – U+D7AF Hangul Syllable

The query :

select gid,name,py_hexdump(name) from prov_admin1_ne;
CREATE or REPLACE FUNCTION py_hexdump ( in_str text)
  RETURNS text
AS $$
  if in_str is None or in_str == '':
    return ''
    
  resT = ' '.join(x.encode('hex') for x in in_str)
  return resT

$$ LANGUAGE plpythonu;

comment:26 by darkblueb, 10 years ago

one clarification on the previous comment ... in UTF8 the top few bits of each byte in a multi-byte character, are assigned to the encoding, so the those values shown are not in Hangul, but are two byte encoded.

comment:27 by darkblueb, 10 years ago

The existing admin1 has only a very few weak spots.

In early January 2014, I built a complete web-based editor for a new, expanded admin1 set, pulled from nvkelso's github account. The new editor is online now, pending nvkelso's feedback.

I suggest we leave admin1 as it is, and broadcast via multiple communities that the new editor is available. If all goes well with the new admin1 layer, the original admin1 is left in place, and the new admin1 is an addition, with improved data.

comment:28 by hamish, 10 years ago

Keywords: 7.0 removed
Milestone: OSGeoLive7.9OSGeoLive8.0

comment:29 by hamish, 10 years ago

Milestone: OSGeoLive8.0OSGeoLive8.5

comment:30 by kalxas, 9 years ago

Milestone: OSGeoLive8.5OSGeoLive9.0

Ticket retargeted after milestone closed

comment:31 by kalxas, 9 years ago

Milestone: OSGeoLive9.0OSGeoLive9.5

Ticket retargeted after milestone closed

comment:32 by kalxas, 8 years ago

Milestone: OSGeoLive9.5OSGeoLive10.0

Ticket retargeted after milestone closed

comment:33 by darkblueb, 8 years ago

Resolution: worksforme
Status: newclosed
Note: See TracTickets for help on using tickets.