#50 closed defect (fixed)
g.copy segfaults (debian.gfoss.it package) [and latest SVN]
Reported by: | steko | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 6.4.0 |
Component: | Default | Version: | 6.3.0 RCs |
Keywords: | g.copy | Cc: | Niccolo Rigacci <niccolo@…> |
CPU: | Unspecified | Platform: | Unspecified |
Description
GRASS 6.3.0svn (snicolo):~ > g.copy rast=fracz,frecz Copy raster <fracz@PERMANENT> to current mapset as <frecz> Segmentation fault
on #grass frankie suggested this should be caused by optimization. However IMHO this is a bug: optimization shouldn't result in a segfault :-)
Attachments (2)
Change History (18)
comment:1 by , 17 years ago
Keywords: | g.copy added |
---|---|
Milestone: | 6.3.0 → 6.4.0 |
comment:2 by , 17 years ago
Yes, that package should have been compiled with a recent GCC 4.2.3, as per Debian sid default. AFAIK the default compiler for Lenny should be even GCC 4.3.
comment:3 by , 17 years ago
Summary: | g.copy segfaults (debian.gfoss.it package) → g.copy segfaults (debian.gfoss.it package) [and latest SVN] |
---|
Ouch. This bug has been biting me for a while now. How can optimization cause a segfault for g.copy rast=rast1,rast2
but not for g.copy vect=vect1,vect2
? This sounds like a nasty bug to me.
I have noticed this problem with the latest SVN on x86 and AMD platforms, using the standard compile options:
# options for opteron: export CFLAGS="-march=opteron -O2" export CXXFLAGS="-march=opteron -O2" #optimization for P4: export CFLAGS="-march=pentium4 -O2" export CXXFLAGS="-march=pentium4 -O2"
comment:4 by , 17 years ago
Some updates:
Glyn says that this can only really be debugged by looking at assembly. Since I cannot do that, another approach might be toggling various optimization flags in order to get an idea of what may be causing the segfault. Below are some details on my system / optimization flags / results.
GCC details:
gcc -v Using built-in specs. Target: i486-linux-gnu Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.1.3 --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu Thread model: posix gcc version 4.1.3 20070601 (prerelease) (Debian 4.1.2-12)
compile flags that cause a segfault: -O3 (probably segfault) -O2 (segfault) -O1 (or) -O (segfault) -O0 (works as expected)
The actual CFLAGS for no optimization were: export CFLAGS="-march=opteron -O0 -ggdb -Wall -Werror-implicit-function-declaration" ; export CXXFLAGS="-march=opteron -O0" ./configure ...
.
So it is probably one of these flags that is causing the optimization-related segfault:
-fauto-inc-dec -fcprop-registers -fdce -fdefer-pop -fdelayed-branch -fdse -fguess-branch-probability -fif-conversion2 -fif-conversion -finline-small-functions -fipa-pure-const -fipa-reference -fmerge-constants -fsplit-wide-types -ftree-ccp -ftree-ch -ftree-copyrename -ftree-dce -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-sra -ftree-ter -funit-at-a-time
by , 17 years ago
Attachment: | broken_g.copy_01_opt_strace.txt added |
---|
strace from g.copy rast=... with the -O1 CFLAG
by , 17 years ago
Attachment: | working_g.copy_strace.txt added |
---|
strace from g.copy rast=... with the -O0 CFLAG
follow-up: 6 comment:5 by , 17 years ago
I have attached two strace logs, one from the working version of g.copy (-O0), and one from the non-working version (-O1). It looks like the first instruction executed by the working copy has something to do with the 'cellhd' file:
access("/data/ogeen_lab_shared/grass/sfrec/rtk/cellhd", F_OK) = 0
This appears to be where the segfault is occuring in the broken version of g.copy-- maybe?
follow-up: 7 comment:6 by , 17 years ago
Replying to dylan:
I have attached two strace logs, one from the working version of g.copy (-O0), and one from the non-working version (-O1).
strace doesn't provide enough detail for this kind of problem. All that can be deduced from the backtraces is that the segfault occurs sometime after the "cell" files are closed (recursive_copy(), general/manage/lib/do_copy.c:152) but before the access() test for whether the cellhd file exists (Gmake_mapset_element(), lib/gis/mapset_msc.c:60, called from do_copy(), general/manage/lib/do_copy.c:42).
AFAICT, the segfault could be occurring in recursive_copy(), G_verbose(), Gmake_mapset_element() or do_copy().
A gdb C backtrace would be more useful, even if it isn't enough to pinpoint the problem. A C backtrace will still report the correct function; it just won't necessarily report the correct line number, or the actual state of any variables (it *might* get this information right, but as you can't tell whether the information is accurate, it doesn't necessarily help).
It might be useful to add some printf() statements to do_copy() to print out variable values. OTOH, adding those statements might just make the bug disappear.
Also, it would be useful to know whether the problem occurs when lib/gis is compiled with optimisation or when general/manage is compiled with optimisation.
comment:7 by , 17 years ago
Replying to glynn:
Replying to dylan:
I have attached two strace logs, one from the working version of g.copy (-O0), and one from the non-working version (-O1).
strace doesn't provide enough detail for this kind of problem. All that can be deduced from the backtraces is that the segfault occurs sometime after the "cell" files are closed (recursive_copy(), general/manage/lib/do_copy.c:152) but before the access() test for whether the cellhd file exists (Gmake_mapset_element(), lib/gis/mapset_msc.c:60, called from do_copy(), general/manage/lib/do_copy.c:42).
AFAICT, the segfault could be occurring in recursive_copy(), G_verbose(), Gmake_mapset_element() or do_copy().
A gdb C backtrace would be more useful, even if it isn't enough to pinpoint the problem. A C backtrace will still report the correct function; it just won't necessarily report the correct line number, or the actual state of any variables (it *might* get this information right, but as you can't tell whether the information is accurate, it doesn't necessarily help).
Thanks for the tips Glynn. Any further hints on how to do the above?
Also, it would be useful to know whether the problem occurs when lib/gis is compiled with optimisation or when general/manage is compiled with optimisation.
After the configure step, how can I disable optimization for specific parts of the source tree?
follow-up: 11 comment:8 by , 17 years ago
Hunting down this bug is not as easy. I once spent some days trying but without luck.(1)
I enabled/disabled all -O1 flags one-by-one, but none of optimisation flags cause segfault. It must be combination of them;(1)
Using valgrind on optimised code is not usefull;(1)
Changing G_lstat() with lstat() in general/manage/lib/do_copy.c fixes problem;(1)
Currently all segfault reporters where using GCC 4.x. I have not tested grass_trunk for some time on this bug, but when it was first reported, I was able to reproduce segfault on two 32bit Gentoo systems with GCC-4.x without stack protection. Glynn - try to disable stack protection before running g.copy. Also please explain how to get proper asm backtrace or any other variables You are interested into to add printf's, as when I used printf's, I was not able to spot any strange values.
comment:9 by , 17 years ago
fyi, debugging howto hints can be viewed/added here:
If GDB is not an option I have found that just adding really basic printfs can help to split the problem in half, then in half again, and so on until you isolate the line the problem is on.
printf("--> made it this far. fn():line xxxx\n");
the raster libs could use a few new permanent G_debug()s at key spots.
Hamish
comment:10 by , 17 years ago
Following Glynn's advice to try isolating the segfault to something in do_copy.c :
The last "make" should re-compile do_copy.c (with -O2) and re-link g.copy. If the problem is in do_copy.c, the resulting g.copy will segfault.
make -C general/manage clean make -C general/manage CFLAGS1='-g -O0' rm general/manage/lib/OBJ.*/do_copy.o rm dist.*/bin/g.copy make -C general/manage CFLAGS1='-g -O2'
This results in a segfault when executing g.copy rast=...
It looks like g.copy was made less UNIX-dependent at some point, with the introduction of recursive_copy replacing cp (?). Could an out of control recursion be broken by compiler optimization?
follow-ups: 12 13 comment:11 by , 17 years ago
Replying to marisn:
Changing G_lstat() with lstat() in general/manage/lib/do_copy.c fixes problem;(1)
Okay, I've installed gcc 4.1.2, can reproduce the segfault, and have identified the problem: LFS.
When --enable-largefile is used, libgis is built with -D_FILE_OFFSET_BITS=64. This causes the "struct stat" used by e.g. G_lstat() in lib/gis/paths.c to be the 64-bit version (which is 96 bytes), and lstat() to be an alias for lstat64(). But general/manage (and, in fact, most of GRASS) is built without that flag, so "struct stat" is the 32-bit version (which is only 88 bytes).
The end result is that "sb" in recursive_copy is only 88 bytes, but G_lstat() writes 96 bytes, overflowing the buffer by 8 bytes. This trashes the saved copy of the %esi register, which holds the "dirp" variable. When recursive_copy() returns, the trashed value is restored to %esi, meaning that "dirp" has been trashed, resulting in the segfault.
In practice, exactly what gets trashed depends upon what is above "sb" on the stack, which depends upon the compiler version and compilation switches. Right now, the only files which use G_stat() or G_lstat() are:
grass=> select * from obj_imp where symbol in ('G_stat', 'G_lstat') ; object | symbol ----------------------------------------------------+--------- general/manage/lib/OBJ.i686-pc-linux-gnu/do_copy.o | G_lstat lib/gis/OBJ.i686-pc-linux-gnu/mapset_msc.o | G_stat lib/gis/OBJ.i686-pc-linux-gnu/remove.o | G_lstat lib/gis/OBJ.i686-pc-linux-gnu/unix_socks.o | G_lstat lib/gis/OBJ.i686-pc-linux-gnu/user_config.o | G_lstat
Four of them are part of libgis, so they will use the correct "struct stat" size. The other is do_copy.c, which loses.
As for possible solutions: one option is to change G_[l]stat() to allocate the "struct stat" with e.g. G_malloc(). Another is to move recursive_copy() into libgis.
The problem with the latter option is that, so long as G_[l]stat() exist in their current form, any code which uses them but which has a different off_t size to libgis will likely get bitten by the same issue (and by the time that it actually gets discovered, we will probably have forgotten about all of this).
comment:12 by , 17 years ago
Replying to glynn:
As for possible solutions: one option is to change G_[l]stat() to allocate the "struct stat" with e.g. G_malloc().
Drat. That isn't entirely safe, as the fields of "struct stat" may be at different offsets depending up the _FILE_OFFSET_BITS value. For the specific case of recursive_copy() on Linux, we'll get away with it, as it's only using the st_mode field, and that doesn't move.
OTOH, for functions which only need to test for a directory, we could add e.g. G_is_directory(path) to libgis, which avoids having to deal with "struct stat" outside of libgis.
comment:13 by , 17 years ago
Replying to glynn:
As for possible solutions: ... move recursive_copy() into libgis.
This is the solution I have chosen; the function is called G_recursive_copy().
Please test g.copy, and close this bug when satisfied.
comment:14 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Seems to be fixed - g.copy doesn't fail anymore. Tested with GRASS trunk rev. 31009 on 32bit Gentoo.
This is a known problem and it manifests itself only when g.copy is compiled with optimisation and system does not have stack protection. Your GRASS was also compiled with GCC-4.x?
More info here: http://wald.intevation.org/tracker/index.php?func=detail&aid=431&group_id=21&atid=204