Opened 15 years ago

Last modified 11 years ago

#612 new defect

g.html2man: parsing leads to man page errors

Reported by: hamish Owned by: grass-dev@…
Priority: normal Milestone: 6.5.0
Component: Docs Version: svn-develbranch6
Keywords: g.html2man, utf8 Cc:
CPU: All Platform: Unspecified

Description

Hi,

tools/g.html2man has a number of parsing problems.

there are a few like cairodriver.1 which happen to start lines with ".", which gets parsed incorrectly by the man program. e.g. in cairodriver it lists ouput formats, and the '.pn' of .png gets hijacked and all those image types end up missing from the resulting man page.

another popular one is <OL><LI> becoming ..IP instead of .IP (e.g. pngdriver.1 just after "Example")

and yet another is g.parser.1 where #%multiple: gets eaten.

detailed list of errors is here: (scroll down to 'grass-doc') http://lintian.debian.org/maintainer/pkg-grass-devel@lists.alioth.debian.org.html#grass

Hamish

Change History (11)

in reply to:  description comment:1 by glynn, 15 years ago

Replying to hamish:

tools/g.html2man has a number of parsing problems.

there are a few like cairodriver.1 which happen to start lines with ".", which gets parsed incorrectly by the man program. e.g. in cairodriver it lists ouput formats, and the '.pn' of .png gets hijacked and all those image types end up missing from the resulting man page.

I've committed some fixes in r37386. Apart from escaping dots and single quotes at the beginning of a line, it doesn't remove leading whitespace from pre-formatted text and doesn't insert line breaks within .IP "..." (this last one only affected d.graph).

another popular one is <OL><LI> becoming ..IP instead of .IP (e.g. pngdriver.1 just after "Example")

and yet another is g.parser.1 where #%multiple: gets eaten.

I can't reproduce these.

detailed list of errors is here: (scroll down to 'grass-doc') http://lintian.debian.org/maintainer/pkg-grass-devel@lists.alioth.debian.org.html#grass

Note that the "bad whatis" entries correspond to an HTML file which lacks a description in the NAME section. This generally only occurs with HTML files which aren't generated from --html-description.

comment:2 by hamish, 15 years ago

Also there is a zillion cases of -flag and --flag using '-' so interpreted as a hyphen not a minus sign. i.e. '-' must be quoted as '\-'. See:

http://lintian.debian.org/tags/hyphen-used-as-minus-sign.html

I looked, but I've got no idea how to backport this stuff to the perl version. does the perl version still need to be there in trunk?

Hamish

in reply to:  2 ; comment:3 by glynn, 15 years ago

Replying to hamish:

Also there is a zillion cases of -flag and --flag using '-' so interpreted as a hyphen not a minus sign. i.e. '-' must be quoted as '\-'. See:

http://lintian.debian.org/tags/hyphen-used-as-minus-sign.html

How is the script supposed to determine whether a '-' in the HTML is a minus or a hyphen? For now, I've changed it to convert all occurrences of '-' to '\-'.

I looked, but I've got no idea how to backport this stuff to the perl version. does the perl version still need to be there in trunk?

No.

in reply to:  3 comment:4 by hamish, 15 years ago

Replying to glynn:

How is the script supposed to determine whether a '-' in the HTML is a minus or a hyphen?

fwiw, lintian's perl detection goes like: http://ftp.de.debian.org/debian/pool/main/l/lintian/lintian_2.2.10.tar.gz

# Catch hyphens used as minus signs by looking for ones at the
# beginning of a word, but don't generate false positives on \s-1
# (small font), \*(-- (pod2man long dash), or things like \h'-1'.
    if ($line =~ /^(
                    ([^\.].*)?
                    [\s\'\"\`\(\[]
                    (?<! \\s | \*\( | \(- | \w\' )
                   )?
                    (--?\w+)/ox) {

For now, I've changed it to convert all occurrences of '-' to '\-'.

ok; cosmetic rendering errors are better than syntax ones I guess.

cheers, Hamish

comment:5 by hamish, 15 years ago

some fixes for non-module help pages (were causing 'mandb -c' whatis errors) in devbr6 and trunk in r37877, ..

hope for testing feedback before backporting to 6.4.

Hamish

in reply to:  5 ; comment:6 by glynn, 15 years ago

Replying to hamish:

some fixes for non-module help pages (were causing 'mandb -c' whatis errors) in devbr6 and trunk in r37877, ..

The changes are meaningless; g.html2man.py discards all comments.

To get a suitable whatis entry, the HTML file needs to include a <h2>NAME</h2> section containing the module name followed by a dash then the description. This is added automatically by --html-description, but non-module pages will need to have it added manually.

in reply to:  6 ; comment:7 by hamish, 15 years ago

Replying to glynn:

The changes are meaningless;

a few qualifiers on that are appropriate: a) currently; b) just for the python version in gr7. (the perl version in all Gr versions now knows about it)

g.html2man.py discards all comments.

the solution I used in the perl version was to check for that meta tag before the comment stripping code.

To get a suitable whatis entry, the HTML file needs to include a <h2>NAME</h2> section containing the module name followed by a dash then the description. This is added automatically by --html-description, but non-module pages will need to have it added manually.

yeah, I look at doing that first. But the <H2>NAME really wasn't appropriate for the intro and driver custom HTML pages I looked at and so I went with the meta-tag solution.

I couldn't see how to make that work with the python version (does HTMLParser.py strip out the comments before we can get our hands on them?), and so I left it for now.

Hamish

in reply to:  7 comment:8 by glynn, 15 years ago

Replying to hamish:

yeah, I look at doing that first. But the <H2>NAME really wasn't appropriate for the intro and driver custom HTML pages I looked at

I'm not so sure.

and so I went with the meta-tag solution.

Actual <meta> tags would be a reasonable solution for any files which genuinely shouldn't have a NAME section, e.g.

<meta name="name" content="grass-dbf" scheme="GRASS">

I couldn't see how to make that work with the python version (does HTMLParser.py strip out the comments before we can get our hands on them?), and so I left it for now.

It's possible to add a handler for comments, but I don't consider this appropriate.

Comments are comments; you are supposed to be able to use them as you wish, without any consequences. The only situation where it's appropriate for an application to take note of comments in its input is if it intends to include them as comments in its output.

comment:9 by neteler, 11 years ago

Milestone: 6.5.07.0.0
Version: 6.4.0 RCssvn-trunk

(since related to g.html2man.py, bumping to trunk)

comment:10 by hamish, 11 years ago

Milestone: 7.0.06.5.0
Version: svn-trunksvn-develbranch6

Actually the situation (rewritten) .py version is better, this bug has to do with the Perl version in 6.x.

see

http://lintian.debian.org/maintainer/pkg-grass-devel@lists.alioth.debian.org.html#grass

scroll down to the "grass-doc" package and see the many warnings about manpage-has-bad-whatis-entry and manpage-has-errors-from-man.

Hamish

comment:11 by hamish, 11 years ago

Keywords: utf8 added

(G6.x only) re. man treating flag names as hyphens and breaking them for cut & paste when utf8 is used, here's some post-processing sed regex to catch many of them:

sed -i -e 's/\([ ([]\)-\([a-z]\)/\1\\-\2/g' \
       -e 's/\([ []\)--\([a-z]\)/\1\\-\\-\2/g' \
       -e 's/\[-\\fB/[\\-\\fB/' \
       -e 's/\[--\\fB/[\\-\\-\\fB/g' \
       -e 's/"\\fB-\([a-zA-Z0-9]\)/"\\fB\\-\1/' \
       -e 's/"\\fB--\([a-zA-Z0-9]\)/"\\fB\\-\\-\1/' \
       "$man_page"

Hamish

Note: See TracTickets for help on using tickets.