Opened 8 years ago

Closed 7 years ago

#896 closed defect (fixed)

sphinx doc build is broken because of BOM

Reported by: fgdrf Owned by: live-demo@…
Priority: minor Milestone:
Component: OSGeoLive Keywords: 6.0
Cc:

Description

Error in daily build log:

Sphinx error:
Unable to decode input data.  Tried the following encodings: 'UTF-8'.
(UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 1: invalid start byte)
make: *** [sphinxbuild] Error 1

Error is because of Byte order Mark (BOM). I assume its coming from a windows based edit. More Details at http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark

The issue came from sponsor.rst edits in Revison 7725

Change history (10)

comment:1 Changed 8 years ago by fgdrf

Just a hint, use Notepad++ on Windows OS and set the encoding to UTF-8 (no BOM) and voila, everything should work.

comment:2 Changed 8 years ago by hamish

the Byte Order Mark has been added and removed from the .csv lists of contributers for a while now.

I haven't really been sure if they should be there or not so only did a quick edit just before the last release to stop the table creation from breaking.

It's easy enough to open with vi and delete the first two chars in the file if needed.. Converting UTF back to ISO-8859-1 isn't too bad either:

iconv -f UTF-8 -t ISO_8859-1 utf_file > iso_file

Qs:

  • Should the BOM be there or not?
  • What files (if any) should be saved in UTF-8, and why? (ISO will not handle non-Western multibytes, but that doesn't necessitate that the English/Western? pages also be in UTF)

this is out of my area of expertise, but the constant "last committer wins" back and forth of text file variants is as we see here causing problems.

any tips from the multi-lingual trenches?

thanks, hHamish

comment:3 Changed 8 years ago by fgdrf

First of all, the sphinx doc build is still broken. But nevertheless I agree whit you to harmonize all the file encodings.

Sphinx assumes, if not configured at conf.py that all files are UTF-8 sources (http://sphinx.pocoo.org/rest.html#source-encoding) , means both documentation (rst) and included files (e.g csv). The configuration value isn't set currently (source-encoding). So we should follow the defaults and should not mix between languages, nevertheless is necessitate or not.

Got the build back working with the following steps (reverenced csv files were not correct, not the printed "sponsors.rst") :

#1 perl -CD -pe 'tr/\x{feff}//d' contributors.csv > xx;mv xx contributors.rst #1 edit the csv file with vi and typed set fileencoding=utf-8, saved and closed afterwards

We should let this ticket still open because of mixed encodings (UTF-8, UTF-8 CRLF, "shell archive or script for antique kernel text", "ASCII text", "PARIX object not stripped", etc):

To analyze the docs do execute in terminal:

for f in `find . -name "*.rst"` ; do file $f ; done | grep -v target

comment:4 Changed 8 years ago by hamish

note all text documents are (ie should be) set with the svn "eol-style=native" svnprop, so CRLF newlines will be automatically dealt with for everyone in a seamless way.

can your perl line be adapted to work with sed?

Hamish

comment:5 Changed 7 years ago by kalxas

Keywords: 6.0 added
Priority: majorcritical

documents are still not building on the disk after 6.0beta1

comment:6 Changed 7 years ago by kalxas

Priority: criticalblocker

Hi, Is there any progress on this issue? I am marking this as a show stopper, since we cannot release without docs :)

comment:7 Changed 7 years ago by camerons

Priority: blockerminor

This error with docs seems to have been fixed (for the moment) by someone before me. I can't reproduce the problem.

However the problem might be reintroduced by a manual edit with the wrong format. Until we work out an automated script to fix the issue, I'll leave this issue open, priority=minor.

comment:8 Changed 7 years ago by oscarfonts

FWIW I'm preprocessing the files before committing, to avoid this kind of problems.

The script is here: https://gist.github.com/2864567

It's using this sed regexp to get rid of BOM:

# Remove BOM sed -i '1 s/\xef\xbb\xbf' $DOC

comment:9 Changed 7 years ago by kalxas

I confirm the fix after testing against beta1.

comment:10 Changed 7 years ago by kalxas

Resolution: fixed
Status: newclosed

I think this is now fixed. Thanks.

Note: See TracTickets for help on using tickets.