= Loading XML files that are not UTF-8 = || '''Date''' || 2011/10/07 || || '''Contact(s)''' || Simon Pigot || || '''Last edited''' || 2011/10/07 || || '''Status''' || complete - ready to commit || || '''Assigned to release''' || 2.7.x || || '''Resources''' || Available || || '''Ticket #''' || #612 || == Overview == XML files that contain characters from sets other than UTF-8 will not load. This happens often as users will paste content from MS documents into XML files containing WINDOWS-1252 characters making the content WINDOWS-1252 rather than UTF-8. This content should be converted to UTF-8 where possible and more importantly to make sure that it doesn't create issues in the rest of the processing stream which almost always assumes UTF-8. === Proposal Type === * '''Type''': Core Change * '''App''': !GeoNetwork * '''Module''': Jeeves === Voting History === * Vote proposed by Simon on 2011/10/07, result was +2 - +1 from Emanuele and +1 from Francois ---- == Motivations == GeoNetwork should be able to load and convert XML files that contain characters from character sets other than UTF-8. For example, loading a file with characters from the WINDOWS-1252 charset causes batch import to fail with a message like: [[Image(failed-batch-import.png)]] With detection and conversion enabled these records load correctly. A warning message is produced in the logs indicating that a character set other than UTF-8 was detected. An example of a record with character converted to UTF-8 from WINDOWS-1252 is: [[Image(metadata-with-converted-utf8.png)]] == Proposal == jeeves.utils.Xml - loadFile method needs to be modified to read the file as a stream of bytes, detect the character set and convert to UTF-8 as required. By default this character set detection capability is enabled by setting the java system property jeeves.filecharsetdetectandconvert. === Backwards Compatibility Issues === None because character set detection and conversion can be disabled on startup by setting the java system property jeeves.filecharsetdetectandconvert to disabled. eg. export JAVA_OPTS="-Djeeves.filecharsetdetectandconvert=disabled" if using tomcat or by editing bin/start-geonetwork.sh for jetty. === New libraries added === juniversalchardet - character set detection jar == Risks == None known. == Participants == * Simon Pigot