Opened 16 years ago

Closed 16 years ago

Last modified 16 years ago

#50 closed defect (fixed)

g.copy segfaults (debian.gfoss.it package) [and latest SVN]

Reported by: steko Owned by: grass-dev@…
Priority: major Milestone: 6.4.0
Component: Default Version: 6.3.0 RCs
Keywords: g.copy Cc: Niccolo Rigacci <niccolo@…>
CPU: Unspecified Platform: Unspecified

Description

GRASS 6.3.0svn (snicolo):~ > g.copy rast=fracz,frecz
Copy raster <fracz@PERMANENT> to current mapset as <frecz>
Segmentation fault

on #grass frankie suggested this should be caused by optimization. However IMHO this is a bug: optimization shouldn't result in a segfault :-)

Attachments (2)

broken_g.copy_01_opt_strace.txt (33.5 KB ) - added by dylan 16 years ago.
strace from g.copy rast=... with the -O1 CFLAG
working_g.copy_strace.txt (41.7 KB ) - added by dylan 16 years ago.
strace from g.copy rast=... with the -O0 CFLAG

Download all attachments as: .zip

Change History (18)

comment:1 by marisn, 16 years ago

Keywords: g.copy added
Milestone: 6.3.06.4.0

This is a known problem and it manifests itself only when g.copy is compiled with optimisation and system does not have stack protection. Your GRASS was also compiled with GCC-4.x?

More info here: http://wald.intevation.org/tracker/index.php?func=detail&aid=431&group_id=21&atid=204

comment:2 by steko, 16 years ago

Yes, that package should have been compiled with a recent GCC 4.2.3, as per Debian sid default. AFAIK the default compiler for Lenny should be even GCC 4.3.

comment:3 by dylan, 16 years ago

Summary: g.copy segfaults (debian.gfoss.it package)g.copy segfaults (debian.gfoss.it package) [and latest SVN]

Ouch. This bug has been biting me for a while now. How can optimization cause a segfault for g.copy rast=rast1,rast2 but not for g.copy vect=vect1,vect2 ? This sounds like a nasty bug to me.

I have noticed this problem with the latest SVN on x86 and AMD platforms, using the standard compile options:

# options for opteron:
export CFLAGS="-march=opteron -O2"
export CXXFLAGS="-march=opteron -O2"
#optimization for P4:
export CFLAGS="-march=pentium4 -O2"
export CXXFLAGS="-march=pentium4 -O2"

comment:4 by dylan, 16 years ago

Some updates:

Glyn says that this can only really be debugged by looking at assembly. Since I cannot do that, another approach might be toggling various optimization flags in order to get an idea of what may be causing the segfault. Below are some details on my system / optimization flags / results.

GCC details:

gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.1.3 --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)

compile flags that cause a segfault: -O3 (probably segfault) -O2 (segfault) -O1 (or) -O (segfault) -O0 (works as expected)

The actual CFLAGS for no optimization were: export CFLAGS="-march=opteron -O0 -ggdb -Wall -Werror-implicit-function-declaration" ; export CXXFLAGS="-march=opteron -O0" ./configure ... .

So it is probably one of these flags that is causing the optimization-related segfault:

-fauto-inc-dec 
          -fcprop-registers 
          -fdce 
          -fdefer-pop 
          -fdelayed-branch 
          -fdse 
          -fguess-branch-probability 
          -fif-conversion2 
          -fif-conversion 
          -finline-small-functions 
          -fipa-pure-const 
          -fipa-reference 
          -fmerge-constants
          -fsplit-wide-types 
          -ftree-ccp 
          -ftree-ch 
          -ftree-copyrename 
          -ftree-dce 
          -ftree-dominator-opts 
          -ftree-dse 
          -ftree-fre 
          -ftree-sra 
          -ftree-ter 
          -funit-at-a-time

by dylan, 16 years ago

strace from g.copy rast=... with the -O1 CFLAG

by dylan, 16 years ago

Attachment: working_g.copy_strace.txt added

strace from g.copy rast=... with the -O0 CFLAG

comment:5 by dylan, 16 years ago

I have attached two strace logs, one from the working version of g.copy (-O0), and one from the non-working version (-O1). It looks like the first instruction executed by the working copy has something to do with the 'cellhd' file:

access("/data/ogeen_lab_shared/grass/sfrec/rtk/cellhd", F_OK) = 0

This appears to be where the segfault is occuring in the broken version of g.copy-- maybe?

in reply to:  5 ; comment:6 by glynn, 16 years ago

Replying to dylan:

I have attached two strace logs, one from the working version of g.copy (-O0), and one from the non-working version (-O1).

strace doesn't provide enough detail for this kind of problem. All that can be deduced from the backtraces is that the segfault occurs sometime after the "cell" files are closed (recursive_copy(), general/manage/lib/do_copy.c:152) but before the access() test for whether the cellhd file exists (Gmake_mapset_element(), lib/gis/mapset_msc.c:60, called from do_copy(), general/manage/lib/do_copy.c:42).

AFAICT, the segfault could be occurring in recursive_copy(), G_verbose(), Gmake_mapset_element() or do_copy().

A gdb C backtrace would be more useful, even if it isn't enough to pinpoint the problem. A C backtrace will still report the correct function; it just won't necessarily report the correct line number, or the actual state of any variables (it *might* get this information right, but as you can't tell whether the information is accurate, it doesn't necessarily help).

It might be useful to add some printf() statements to do_copy() to print out variable values. OTOH, adding those statements might just make the bug disappear.

Also, it would be useful to know whether the problem occurs when lib/gis is compiled with optimisation or when general/manage is compiled with optimisation.

in reply to:  6 comment:7 by dylan, 16 years ago

Replying to glynn:

Replying to dylan:

I have attached two strace logs, one from the working version of g.copy (-O0), and one from the non-working version (-O1).

strace doesn't provide enough detail for this kind of problem. All that can be deduced from the backtraces is that the segfault occurs sometime after the "cell" files are closed (recursive_copy(), general/manage/lib/do_copy.c:152) but before the access() test for whether the cellhd file exists (Gmake_mapset_element(), lib/gis/mapset_msc.c:60, called from do_copy(), general/manage/lib/do_copy.c:42).

AFAICT, the segfault could be occurring in recursive_copy(), G_verbose(), Gmake_mapset_element() or do_copy().

A gdb C backtrace would be more useful, even if it isn't enough to pinpoint the problem. A C backtrace will still report the correct function; it just won't necessarily report the correct line number, or the actual state of any variables (it *might* get this information right, but as you can't tell whether the information is accurate, it doesn't necessarily help).

Thanks for the tips Glynn. Any further hints on how to do the above?

Also, it would be useful to know whether the problem occurs when lib/gis is compiled with optimisation or when general/manage is compiled with optimisation.

After the configure step, how can I disable optimization for specific parts of the source tree?

comment:8 by marisn, 16 years ago

Hunting down this bug is not as easy. I once spent some days trying but without luck.(1)
I enabled/disabled all -O1 flags one-by-one, but none of optimisation flags cause segfault. It must be combination of them;(1)
Using valgrind on optimised code is not usefull;(1)
Changing G_lstat() with lstat() in general/manage/lib/do_copy.c fixes problem;(1)
Currently all segfault reporters where using GCC 4.x. I have not tested grass_trunk for some time on this bug, but when it was first reported, I was able to reproduce segfault on two 32bit Gentoo systems with GCC-4.x without stack protection. Glynn - try to disable stack protection before running g.copy. Also please explain how to get proper asm backtrace or any other variables You are interested into to add printf's, as when I used printf's, I was not able to spot any strange values.

  1. http://wald.intevation.org/tracker/index.php?func=detail&aid=431&group_id=21&atid=204

comment:9 by hamish, 16 years ago

fyi, debugging howto hints can be viewed/added here:

http://grass.gdf-hannover.de/wiki/GRASS_Debugging

If GDB is not an option I have found that just adding really basic printfs can help to split the problem in half, then in half again, and so on until you isolate the line the problem is on.

printf("--> made it this far. fn():line xxxx\n");

the raster libs could use a few new permanent G_debug()s at key spots.

Hamish

comment:10 by dylan, 16 years ago

Following Glynn's advice to try isolating the segfault to something in do_copy.c :

The last "make" should re-compile do_copy.c (with -O2) and re-link g.copy. If the problem is in do_copy.c, the resulting g.copy will segfault.

make -C general/manage clean
make -C general/manage CFLAGS1='-g -O0'
rm general/manage/lib/OBJ.*/do_copy.o
rm dist.*/bin/g.copy
make -C general/manage CFLAGS1='-g -O2'

This results in a segfault when executing g.copy rast=...

It looks like g.copy was made less UNIX-dependent at some point, with the introduction of recursive_copy replacing cp (?). Could an out of control recursion be broken by compiler optimization?

in reply to:  8 ; comment:11 by glynn, 16 years ago

Replying to marisn:

Changing G_lstat() with lstat() in general/manage/lib/do_copy.c fixes problem;(1)

Okay, I've installed gcc 4.1.2, can reproduce the segfault, and have identified the problem: LFS.

When --enable-largefile is used, libgis is built with -D_FILE_OFFSET_BITS=64. This causes the "struct stat" used by e.g. G_lstat() in lib/gis/paths.c to be the 64-bit version (which is 96 bytes), and lstat() to be an alias for lstat64(). But general/manage (and, in fact, most of GRASS) is built without that flag, so "struct stat" is the 32-bit version (which is only 88 bytes).

The end result is that "sb" in recursive_copy is only 88 bytes, but G_lstat() writes 96 bytes, overflowing the buffer by 8 bytes. This trashes the saved copy of the %esi register, which holds the "dirp" variable. When recursive_copy() returns, the trashed value is restored to %esi, meaning that "dirp" has been trashed, resulting in the segfault.

In practice, exactly what gets trashed depends upon what is above "sb" on the stack, which depends upon the compiler version and compilation switches. Right now, the only files which use G_stat() or G_lstat() are:

grass=> select * from obj_imp where symbol in ('G_stat', 'G_lstat') ;
                       object                       | symbol  
----------------------------------------------------+---------
 general/manage/lib/OBJ.i686-pc-linux-gnu/do_copy.o | G_lstat
 lib/gis/OBJ.i686-pc-linux-gnu/mapset_msc.o         | G_stat
 lib/gis/OBJ.i686-pc-linux-gnu/remove.o             | G_lstat
 lib/gis/OBJ.i686-pc-linux-gnu/unix_socks.o         | G_lstat
 lib/gis/OBJ.i686-pc-linux-gnu/user_config.o        | G_lstat

Four of them are part of libgis, so they will use the correct "struct stat" size. The other is do_copy.c, which loses.

As for possible solutions: one option is to change G_[l]stat() to allocate the "struct stat" with e.g. G_malloc(). Another is to move recursive_copy() into libgis.

The problem with the latter option is that, so long as G_[l]stat() exist in their current form, any code which uses them but which has a different off_t size to libgis will likely get bitten by the same issue (and by the time that it actually gets discovered, we will probably have forgotten about all of this).

in reply to:  11 comment:12 by glynn, 16 years ago

Replying to glynn:

As for possible solutions: one option is to change G_[l]stat() to allocate the "struct stat" with e.g. G_malloc().

Drat. That isn't entirely safe, as the fields of "struct stat" may be at different offsets depending up the _FILE_OFFSET_BITS value. For the specific case of recursive_copy() on Linux, we'll get away with it, as it's only using the st_mode field, and that doesn't move.

OTOH, for functions which only need to test for a directory, we could add e.g. G_is_directory(path) to libgis, which avoids having to deal with "struct stat" outside of libgis.

in reply to:  11 comment:13 by glynn, 16 years ago

Replying to glynn:

As for possible solutions: ... move recursive_copy() into libgis.

This is the solution I have chosen; the function is called G_recursive_copy().

Please test g.copy, and close this bug when satisfied.

comment:14 by marisn, 16 years ago

Resolution: fixed
Status: newclosed

Seems to be fixed - g.copy doesn't fail anymore. Tested with GRASS trunk rev. 31009 on 32bit Gentoo.

comment:15 by neteler, 16 years ago

Should this be backported to 6.3.0 release branch?

Markus

Note: See TracTickets for help on using tickets.