wiki:loadingxmlcontentcharsetdetection

Loading XML files that are not UTF-8

Date 2011/10/07
Contact(s) Simon Pigot
Last edited 2011/10/07
Status complete - ready to commit
Assigned to release 2.7.x
Resources Available
Ticket # #612

Overview

XML files that contain characters from sets other than UTF-8 will not load. This happens often as users will paste content from MS documents into XML files containing WINDOWS-1252 characters making the content WINDOWS-1252 rather than UTF-8. This content should be converted to UTF-8 where possible and more importantly to make sure that it doesn't create issues in the rest of the processing stream which almost always assumes UTF-8.

Proposal Type

  • Type: Core Change
  • App: GeoNetwork
  • Module: Jeeves

Voting History

  • Vote proposed by Simon on 2011/10/07, result was +2 - +1 from Emanuele and +1 from Francois

Motivations

GeoNetwork should be able to load and convert XML files that contain characters from character sets other than UTF-8. For example, loading a file with characters from the WINDOWS-1252 charset causes batch import to fail with a message like:

With detection and conversion enabled these records load correctly. A warning message is produced in the logs indicating that a character set other than UTF-8 was detected. An example of a record with character converted to UTF-8 from WINDOWS-1252 is:

Proposal

jeeves.utils.Xml - loadFile method needs to be modified to read the file as a stream of bytes, detect the character set and convert to UTF-8 as required. By default this character set detection capability is enabled by setting the java system property jeeves.filecharsetdetectandconvert.

Backwards Compatibility Issues

None because character set detection and conversion can be disabled on startup by setting the java system property jeeves.filecharsetdetectandconvert to disabled. eg. export JAVA_OPTS="-Djeeves.filecharsetdetectandconvert=disabled" if using tomcat or by editing bin/start-geonetwork.sh for jetty.

New libraries added

juniversalchardet - character set detection jar

Risks

None known.

Participants

  • Simon Pigot
Last modified 13 years ago Last modified on 10/16/11 23:43:50

Attachments (2)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.