Opened 11 years ago

Closed 5 years ago

#2185 closed defect (fixed)

Painfully Slow 'v.in.ogr' Vector Import

Reported by: justinzane Owned by: grass-dev@…
Priority: normal Milestone: 7.0.7
Component: Vector Version: svn-trunk
Keywords: import, OGR, performance, v.in.ogr Cc: justinzane
CPU: x86-64 Platform: Linux

Description

I'm trying to import shapefiles (in this case, OpenStreetMap data) into GRASS (grass70-svn) using v.in.ogr; and it is painfully, brutally slow. By slow, I mean between 13 and 28 primitives per second.

At first I thought this was because I was using the default sqlite database on an NFS mount. But, trying sqlite on a local disk and trying MySQL to separate server both failed to improve performance. I'm getting a consistent read/write of ~125Kbps/~550Kbps no matter which shapefile or which database I use.

I cannot see any bottlenecks anywhere, as the two python /path/to/grass-gui processes are averaging 10% of one core combined, the network utilization is below 10% consistently, and the DB server is almost quiescent. To give an example of how painful this is:

v.in.ogr dsn=/home/justin/downloads/osm_CA/places.shp WARNING: All available OGR layers will be imported into vector map <places> Check if OGR layer <places> contains polygons... Importing 7020 features (OGR layer <places>)... ----------------------------------------------------- Building topology for vector map <places@OSM>... Registering primitives... 7020 primitives registered 7020 vertices registered Building areas... 0 areas built 0 isles built Attaching islands... Attaching centroids... Number of nodes: 0 Number of primitives: 7020 Number of points: 7020 Number of lines: 0 Number of boundaries: 0 Number of centroids: 0 Number of areas: 0 Number of isles: 0 (Fri Jan 31 19:01:08 2014) Command finished (4 min 38 sec)

Note that this may belong under Database.

Attachments (3)

world_AOI_latlong.geojson (570 bytes ) - added by neteler 7 years ago.
Single polygon geometry in Latlong
intersect.c.patch (1.3 KB ) - added by hcho 7 years ago.
lib/vector/Vlib/intersect.c.patch
intersect.c.patch2 (761 bytes ) - added by mmetz 7 years ago.
patch for Vect_segment_intersection()

Download all attachments as: .zip

Change History (46)

comment:1 by justinzane, 11 years ago

Cc: justinzane added

Running under perf several times, and using -Og -g builds of grass7 and gdal, it seems that somewhere between 30-40% of the execution time is happening in Python or the shell. That seems odd, but I'm ignorant...

Additionally, it seems that grass is sending a query for every write to the DB, synchronously, instead of either doing the DB IO in another thread. That seems a poor use of multicored systems.

in reply to:  description comment:2 by mmetz, 11 years ago

Replying to justinzane:

I'm trying to import shapefiles (in this case, OpenStreetMap data) into GRASS (grass70-svn) using v.in.ogr; and it is painfully, brutally slow. By slow, I mean between 13 and 28 primitives per second.

I cannot see any bottlenecks anywhere, as the two python /path/to/grass-gui processes are averaging 10% of one core combined, the network utilization is below 10% consistently, and the DB server is almost quiescent. To give an example of how painful this is:

     v.in.ogr dsn=/home/justin/downloads/osm_CA/places.shp                           
     WARNING: All available OGR layers will be imported into vector map <places>
     Check if OGR layer <places> contains polygons...
     Importing 7020 features (OGR layer <places>)...
     -----------------------------------------------------
     Building topology for vector map <places@OSM>...
     Registering primitives...
     7020 primitives registered
     7020 vertices registered
     Building areas...
     0 areas built
     0 isles built
     Attaching islands...
     Attaching centroids...
     Number of nodes: 0
     Number of primitives: 7020
     Number of points: 7020
     Number of lines: 0
     Number of boundaries: 0
     Number of centroids: 0
     Number of areas: 0
     Number of isles: 0
     (Fri Jan 31 19:01:08 2014) Command finished (4 min 38 sec)

On a standard laptop with default settings, the import of points is about 140 x faster. 7000 points from a shapefile should be imported in less than 2 seconds.

You would need to monitor the v.in.ogr process and the corresponding db driver process (sqlite for the default sqlite database in GRASS 7) instead of any Python processes and see if v.in.ogr uses close to 100% CPU. If not, sqlite should be very busy, but that should not happen because all changes to the database are committed at once after importing the primitives.

comment:3 by justinzane, 11 years ago

CPU: Unspecifiedx86-64

Thanks for the response. I cannot post the dot graphs because of size restrictions, and resizing them makes them near unreadable.

I did another little experiement, and, it seems that the problem is not with the database driver at all. WRT MySQL and SQLite, it is the location of the 'grassdata' folder that is at issue. I had kept 'grassdata' on an NFS mount. Moving it to a local (btrfs) spinning disk meant that the import was many times faster (although moron forgot to add 'time' before the command :( ).

I'm rebuilding an unstripped glibc and other dependencies to see exactly which calls may be problematic.

As far as monitoring the mysql and v.in.ogr processes, do you have any recommended way besides running grass like:

perf record -e cycles:u -o grass.prof -g -- grass64 -text /home/justin/maps/grassdata/osm_data/OSM; 
####
# In grass shell [zsh]:
# v.in.ogr dsn=./path/to/shape.shp output=shape
# exit
####
perf script -i grass.prof | gprof2dot -f perf  | dot -Tpng -o output.png; 
gwenview output.png

in reply to:  3 comment:4 by wenzeslaus, 11 years ago

Replying to justinzane:

Thanks for the response. I cannot post the dot graphs because of size restrictions, and resizing them makes them near unreadable.

Curiosity: Can you post the .dot file? Is it small enough, e.g. 7ziped or after using gprof2dot's --node-thres or --edge-thres?

comment:5 by justinzane, 11 years ago

I'll work on it. I'm rebuilding a few things now [and getting wedding pics ready to send to my niece, so dcraw is hogging the cpu and net] so as soon as I can do a couple of comparisons for you I will. I'll post them to my webserver if they are too large, and just add the links.

comment:6 by justinzane, 11 years ago

Please see http://www.justinzane.com/grass-perf.html for some more detailed data. I'm greatly interested in guidance on narrowing down the bottleneck with "grassdata" on a NFS mount, but I may need some assistance in the process. If anyone is on IRC, I'd be more than happy to work there. Thanks.

in reply to:  6 comment:7 by mmetz, 11 years ago

Replying to justinzane:

I'm greatly interested in guidance on narrowing down the bottleneck with "grassdata" on a NFS mount, but I may need some assistance in the process.

I don't know much about perf, but I do have some experience with accessing a GRASS database over NFS (multiple clients accessing the same GRASS database over NFS at the same time). The problem seems to be not the GRASS + NFS combination, but the connection to the NFS server (should be at least a Gigabit connection and static IP addresses for the server and all its clients) and other traffic simultaneously going to the NFS server from other clients.

You could

  • check the IO traffic on the NFS server without using GRASS
  • copy some stuff, e.g. an unpacked Linux kernel, from the local disk over to the NFS mount and check the performance
  • investigate the actually used NFS mount options in /proc/mounts
  • try to tune NFS mount options, particularly rsize,wsize, bigger seems to be better, more in man nfs

comment:8 by justinzane, 11 years ago

[quote]check the IO traffic on the NFS server without using GRASSquote Did, using iotop while running v.in.ogr -- No noticeable IO besides my GRASS import, which is expected since it is a personal server.

[quote]copy some stuff, e.g. an unpacked Linux kernel, from the local disk over to the NFS mount and check the performancequote Works swimmingly for large files. I updated the post at [www.justinzane.com/grass-perf.html] to show a simplistic test. Basically, though my Dell laptop has a castrated NIC -- the 4 gigabit pins are not connected! -- NFS works at almost line speed. Throughput is not the problem, rather it seems to be excessive opens/syncs/similar that is the issue.

Using nfsstat -Z10, I get about 27 writes per second. Which seems to equate to one write per primitive! I'm a GIS novice, but it seems that writing for every primitive is not efficient from a programming perspective. Am I missing something?

[quote]investigate the actually used NFS mount options in /proc/mountsquote nfs4 rw,relatime,vers=4.0,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=192.168.0.182,local_lock=none,addr=192.168.0.10 0 0

[quote]try to tune NFS mount options, particularly rsize,wsize, bigger seems to be better, more in man nfs quote With writes of a few bytes to a few kB, this does not matter.

comment:9 by mmetz, 11 years ago

I did some tests with and without NFS, importing a shapefile with 50000 points and 19 attribute fields:

  • shapefile on local disk, grassdata on local disk: 12000 points / sec
  • shapefile on local disk, grassdata on NFS: 2200 points / sec
  • shapefile on NFS, grassdata on NFS: 2100 points / sec
  • shapefile on NFS, grassdata on local disk: 11000 points / sec

Here, writing to grassdata on NFS is also much slower, but always much faster than about 20 points / sec. The NFS server here is not at all a high-performance server, instead designed for power-saving.

Your minifstest gives here

NumChars: Local (s)   NFS (s)  NFS/Local
00000002:  000.0100, 000.0200, 002.0000
00000004:  000.0100, 000.0200, 002.0000
00000008:  000.0200, 000.0300, 001.5000
00000016:  000.0200, 000.0400, 002.0000
00000032:  000.0400, 000.0400, 001.0000
00000128:  000.1300, 000.1600, 001.2308
00000512:  000.5200, 000.5300, 001.0192

The penalty for doing lots of tiny open-write-close cycles on NFS is here maximum 2, not 7.

IMHO, something is strange with your NFS server, but I have no idea what.

comment:10 by justinzane, 11 years ago

I noticed that. There are a bunch of nfs/nfs4 perf events; but, they require root privileges when running perf. I'm trying to get advice (elsewhere) on how to properly use them without running GRASS as root. As soon as I am able to examine the NFS behaviour, I'll post the results. I'm also pulling the code so I can speak with less ignorance. :)

comment:11 by hamish, 11 years ago

Keywords: v.in.ogr added

@justinzane: if you get a few moments could you add a quick example on our wiki of how end users can run pref? tx.

http://grasswiki.osgeo.org/wiki/Bugs#Using_a_profiling_tool

@mmetz: some sample OSM data to test is available here:

http://download.osgeo.org/livedvd/data/osm/

I'm still not understanding if this trouble is CPU bound, local IO bound, or network bound. Some combo of top, iotop, iftop, and gkrellm would hopefully shed some light on that without having to push everything through a profiling tool. What does 'top -i' show about the v.in.ogr and sqlite processes and kernel %use state? is it dead locked? spending all its time in the kernel? libogr..?

regards, Hamish

comment:12 by hamish, 11 years ago

btw, you might also try osm2pgsql to load the OSM data into a PostGIS Postgres DB, then live-connect to the PostGIS db with grass 7 with v.external. for example:

https://trac.osgeo.org/osgeo/browser/livedvd/gisvm/trunk/bin/load_postgis.sh#L85

otherwise map coordinates are stored in $MAPSET/vector in grass's vector format, and only feature attributes are stored in the DB. Map features are connected to attributes via category number.

comment:13 by imincik1, 10 years ago

I am experiencing this issue too. It seems to be a problem with all modules which are writing simple vector geometries in loop. I am going to investigate some sqlite tweeks [1].

1 - http://www.sqlite.org/faq.html#q19

comment:14 by imincik1, 10 years ago

In my situation, I have whole home directory including GRASS data dir on NFS 4 mount.

  1. After short investigation, it seems, that setting PRAGMA synchronous=OFF on SQLite db has no effect.
  1. What was a great change was moving '.tmp' directory from my mapset outside of NFS (by using symlink). v.in.ogr was 8 times faster.

So my idea is, is it possible to have some environment variable, which could define global alternative directory for temporary files ? This could improve situation for all modules which are using temporary files (for example v.random is not using).

in reply to:  14 ; comment:15 by neteler, 10 years ago

Replying to imincik1: ...

  1. What was a great change was moving '.tmp' directory from my mapset outside of NFS (by using symlink). v.in.ogr was 8 times faster.

So my idea is, is it possible to have some environment variable, which could define global alternative directory for temporary files ? This could improve situation for all modules which are using temporary files (for example v.random is not using).

Can you please try this variable? http://grass.osgeo.org/grass70/manuals/variables.html

--> "TMPDIR, TEMP, TMP" (name depends on the operating system)

in reply to:  15 ; comment:16 by imincik1, 10 years ago

Replying to neteler:

Replying to imincik1: ...

  1. What was a great change was moving '.tmp' directory from my mapset outside of NFS (by using symlink). v.in.ogr was 8 times faster.

So my idea is, is it possible to have some environment variable, which could define global alternative directory for temporary files ? This could improve situation for all modules which are using temporary files (for example v.random is not using).

Can you please try this variable? http://grass.osgeo.org/grass70/manuals/variables.html

--> "TMPDIR, TEMP, TMP" (name depends on the operating system)

Thanks for this tip, I will try.

To be sure, I mean temporary directory located in mapset: $MAPSET/.tmp/$HOSTNAME/vector . There is already some variable called GRASS_VECTOR_TEMPORARY, but according description, it is just enabling and disabling temporary dir. I will play with both and report back.

in reply to:  16 comment:17 by martinl, 10 years ago

Replying to imincik1:

To be sure, I mean temporary directory located in mapset: $MAPSET/.tmp/$HOSTNAME/vector . There is already some variable called GRASS_VECTOR_TEMPORARY, but according description, it is just enabling and disabling temporary dir. I will play with both and report back.

yes, if GRASS_VECTOR_TEMPORARY is defined than a vector map is created in $MAPSET/.tmp/$HOSTNAME/vector.

comment:18 by imincik1, 10 years ago

TMPDIR, TEMP, TMP has no effect on temporary directory path created by vector library.

I did one more test on importing some vector data by v.in.ogr

  1. temporary directory located on NFS = 16 seconds
  2. temporary directory located on NBD = 2 seconds
  3. temporary directory located in memory (tmpfs on /mnt/booster type tmpfs): 0.8 seconds !

It seems that possibility to set temporary directory path created by vector library to some quicker file system can be beneficial for all users.

in reply to:  15 comment:19 by wenzeslaus, 10 years ago

Replying to neteler:

Replying to imincik1: ...

  1. What was a great change was moving '.tmp' directory from my mapset outside of NFS (by using symlink). v.in.ogr was 8 times faster.

So my idea is, is it possible to have some environment variable, which could define global alternative directory for temporary files ? This could improve situation for all modules which are using temporary files (for example v.random is not using).

Can you please try this variable? http://grass.osgeo.org/grass70/manuals/variables.html

--> "TMPDIR, TEMP, TMP" (name depends on the operating system)

Manual is not completely clear about what is driven by those but is says that /tmp (or perhaps its platform-dependent equivalent, i.e. system tmp dir) is the default which suggests that it is a different directory than /path/to/mapset/.tmp (i.e. GRASS-specific dir).

TMPDIR, TEMP, TMP
    [Various GRASS GIS commands and wxGUI]
    The default wxGUI temporary directory is chosen from a platform-dependent list,
    but the user can control the selection of this directory by setting one of the
    TMPDIR, TEMP or TMP environment variables Hence the wxGUI uses $TMPDIR if it
    is set, then $TEMP, otherwise /tmp.

I'm not sure what about the other things but for wxGUI it seems that if your grassdata dir is at NFS it might be more advantageous if the tmp dir for GUI would the the system one because it would/could work much faster (supposing the case of generating images for map display). On the other hand if you are creating something really big, e.g. animation with g.gui.animation it might be better if you use the bigger space (which might be the .tmp dir in grassdata at NFS).

in reply to:  18 ; comment:20 by martinl, 9 years ago

Replying to imincik1:

TMPDIR, TEMP, TMP has no effect on temporary directory path created by vector library.

Starting with r65348 when GRASS_TMPDIR_MAPSET is set to '0', the GRASS data TMPDIR is not $LOCATION/$MAPSET/.tmp/$HOSTNAME, but $TMPDIR/.tmp/$HOSTNAME.

in reply to:  20 comment:21 by wenzeslaus, 9 years ago

Replying to martinl:

Replying to imincik1:

TMPDIR, TEMP, TMP has no effect on temporary directory path created by vector library.

Starting with r65348 when GRASS_TMPDIR_MAPSET is set to '0', the GRASS data TMPDIR is not $LOCATION/$MAPSET/.tmp/$HOSTNAME, but $TMPDIR/.tmp/$HOSTNAME.

Sounds good. I'm curious how this will work for the mounted disc use case.

Just few clarifications. I was trying to read the source code few days ago and I think that there is no actual usage of HOSTNAME variable but some C functions are used to retrieve this value (which probably is the value of HOSTNAME variable). Also you say "GRASS data TMPDIR" but the actual TMPDIR is a variable pointing to GRASS's directory in system temporary directory. Is this meant in symbolic way as $LOCATION and $MAPSET are meant or am I missing something?

Do I understand correctly that you can change it anytime (inside GRASS session), so grass.py has to always clear both places? (Or it is even not worth attempting to care for this case in grass.py.)

comment:22 by martinl, 8 years ago

Milestone: 7.0.07.0.5

comment:23 by neteler, 8 years ago

Milestone: 7.0.57.0.6

comment:24 by neteler, 7 years ago

Not sure if this is the right ticket. Attached a GeoJSON geometry covering most of the world (digitized in a Web GIS in Web Mercator in a lousy way which is reflecting reality and reprojected to Latlong with ogr2ogr).

It takes "forever" on the "Breaking boundaries..." step while the geometry is a single polygon. I wonder why?

by neteler, 7 years ago

Attachment: world_AOI_latlong.geojson added

Single polygon geometry in Latlong

in reply to:  24 comment:25 by mmetz, 7 years ago

Replying to neteler:

Not sure if this is the right ticket. Attached a GeoJSON geometry covering most of the world (digitized in a Web GIS in Web Mercator in a lousy way which is reflecting reality and reprojected to Latlong with ogr2ogr).

It takes "forever" on the "Breaking boundaries..." step while the geometry is a single polygon. I wonder why?

With trunk, import happens instantly (0.08 seconds). What GRASS version are you using? Do you have changes in your local copy?

comment:26 by neteler, 7 years ago

Since it is a production system, I'm on 7.2.svn

in reply to:  26 comment:27 by mmetz, 7 years ago

Replying to neteler:

Since it is a production system, I'm on 7.2.svn

With 7.2, I get the same instant (0.08 seconds) import with

v.in.ogr in=world_AOI_latlong.geojson out=world_AOI_latlong

comment:28 by neteler, 7 years ago

Strange, here it is not fast at all:

GRASS 7.2.1svn (latlong_wgs84):~ > time -p v.in.ogr in=world_AOI_latlong.geojson out=world_AOI_latlong
Check if OGR layer <OGRGeoJSON> contains polygons...
 100%
WARNING: Width for column properties set to 255 (was not specified by OGR),
         some strings may be truncated!
Importing 1 features (OGR layer <OGRGeoJSON>)...
 100%
-----------------------------------------------------
Registering primitives...
One primitive registered
7 vertices registered
Number of nodes: 1
Number of primitives: 1
Number of points: 0
Number of lines: 0
Number of boundaries: 1
Number of centroids: 0
Number of areas: -
Number of isles: -
-----------------------------------------------------
Cleaning polygons
-----------------------------------------------------
Breaking polygons...
Breaking polygons (pass 1: select break points)...
 100%
Breaking polygons (pass 2: break at selected points)...
 100%
-----------------------------------------------------
Removing duplicates...
 100%
-----------------------------------------------------
Breaking boundaries...
98% ^C
real 373.04
user 371.13
sys 0.97

GRASS 7.2.1svn (latlong_wgs84):~ > g.version -g
version=7.2.1svn
date=2017
revision=r70769
build_date=2017-03-20
build_platform=x86_64-pc-linux-gnu
build_off_t_size=8

GRASS 7.2.1svn (latlong_wgs84):~ > cs2cs 
Rel. 4.9.2, 08 September 2015
usage: cs2cs [ -eEfIlrstvwW [args] ] [ +opts[=arg] ]
                   [+to [+opts[=arg] [ files ]

GRASS 7.2.1svn (latlong_wgs84):~ > gdal-config --version
2.1.2

db.connect -p
driver: sqlite
database: /home/mundialis/grassdata/latlong_wgs84/user1/sqlite/sqlite.db
...

"svn status" does not show a single local modification.

The disk is a SSD drive. The system is a Fedora 25 installation:

rpm -qa | sort | grep 'geos\|gdal\|proj'
gdal-2.1.2-5.fc25.x86_64
gdal-devel-2.1.2-5.fc25.x86_64
gdal-grass-2.1.2-1.fc25.x86_64
gdal-libs-2.1.2-5.fc25.x86_64
gdal-python-2.1.2-5.fc25.x86_64
geos-3.5.0-3.fc25.x86_64
geos-devel-3.5.0-3.fc25.x86_64
proj-4.9.2-2.fc24.x86_64
proj-devel-4.9.2-2.fc24.x86_64
proj-epsg-4.9.2-2.fc24.x86_64
proj-nad-4.9.2-2.fc24.x86_64
pyproj-1.9.5.1-3.fc25.x86_64

strace shows a lot of lseek() if that matters:

...
lseek(6, 814250, SEEK_SET)              = 814250
lseek(6, 814250, SEEK_SET)              = 814250
lseek(6, 814250, SEEK_SET)              = 814250
lseek(6, 814250, SEEK_SET)              = 814250
lseek(6, -138, SEEK_CUR)                = 814112
write(6, "\f", 1)                       = 1
fstat(6, {st_mode=S_IFREG|0664, st_size=814250, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3242) = 3242
write(6, "\r\6\0\0\0\246n\300\212l]e\300\305\240g\372\204\344e\300d\216\25\215\1ve@\357B\276"..., 101) = 101
fstat(6, {st_mode=S_IFREG|0664, st_size=814351, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3343) = 3343
write(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\23_\363\0\254\350T@\24_\363"..., 37) = 37
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3205) = 3205
read(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\24_\363\0\254\350T@\23_\363"..., 4096) = 175
lseek(6, 814388, SEEK_SET)              = 814388
lseek(6, 814388, SEEK_SET)              = 814388
lseek(6, 814388, SEEK_SET)              = 814388
lseek(6, 814388, SEEK_SET)              = 814388
lseek(6, 814388, SEEK_SET)              = 814388
lseek(6, 814388, SEEK_SET)              = 814388
lseek(6, 814388, SEEK_SET)              = 814388
lseek(6, 814388, SEEK_SET)              = 814388
lseek(6, -138, SEEK_CUR)                = 814250
write(6, "\f", 1)                       = 1
fstat(6, {st_mode=S_IFREG|0664, st_size=814388, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3380) = 3380
write(6, "\r\6\0\0\0\246n\300\212l]e\300\305\240g\372\204\344e\300d\216\25\215\1ve@\357B\276"..., 101) = 101
fstat(6, {st_mode=S_IFREG|0664, st_size=814489, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3481) = 3481
write(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\24_\363\0\254\350T@\23_\363"..., 37) = 37
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3343) = 3343
read(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\23_\363\0\254\350T@\24_\363"..., 4096) = 175
lseek(6, 814526, SEEK_SET)              = 814526
lseek(6, 814526, SEEK_SET)              = 814526
lseek(6, 814526, SEEK_SET)              = 814526
lseek(6, 814526, SEEK_SET)              = 814526
lseek(6, -138, SEEK_CUR)                = 814388
write(6, "\f", 1)                       = 1
fstat(6, {st_mode=S_IFREG|0664, st_size=814526, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3518) = 3518
write(6, "\r\6\0\0\0\246n\300\212l]e\300\305\240g\372\204\344e\300d\216\25\215\1ve@\357B\276"..., 101) = 101
fstat(6, {st_mode=S_IFREG|0664, st_size=814627, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3619) = 3619
write(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\23_\363\0\254\350T@\24_\363"..., 37) = 37
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3481) = 3481
read(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\24_\363\0\254\350T@\23_\363"..., 4096) = 175
lseek(6, 814664, SEEK_SET)              = 814664
lseek(6, 814664, SEEK_SET)              = 814664
lseek(6, 814664, SEEK_SET)              = 814664
lseek(6, 814664, SEEK_SET)              = 814664
lseek(6, 814664, SEEK_SET)              = 814664
lseek(6, 814664, SEEK_SET)              = 814664
lseek(6, 814664, SEEK_SET)              = 814664
lseek(6, 814664, SEEK_SET)              = 814664
lseek(6, -138, SEEK_CUR)                = 814526
write(6, "\f", 1)                       = 1
fstat(6, {st_mode=S_IFREG|0664, st_size=814664, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3656) = 3656
write(6, "\r\6\0\0\0\246n\300\212l]e\300\305\240g\372\204\344e\300d\216\25\215\1ve@\357B\276"..., 101) = 101
fstat(6, {st_mode=S_IFREG|0664, st_size=814765, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3757) = 3757
write(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\24_\363\0\254\350T@\23_\363"..., 37) = 37
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3619) = 3619
read(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\23_\363\0\254\350T@\24_\363"..., 4096) = 175
lseek(6, 814802, SEEK_SET)              = 814802
lseek(6, 814802, SEEK_SET)              = 814802
lseek(6, 814802, SEEK_SET)              = 814802
lseek(6, 814802, SEEK_SET)              = 814802
lseek(6, -138, SEEK_CUR)                = 814664
write(6, "\f", 1)                       = 1
fstat(6, {st_mode=S_IFREG|0664, st_size=814802, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3794) = 3794
write(6, "\r\6\0\0\0\246n\300\212l]e\300\305\240g\372\204\344e\300d\216\25\215\1ve@\357B\276"..., 101) = 101
fstat(6, {st_mode=S_IFREG|0664, st_size=814903, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3895) = 3895
write(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\23_\363\0\254\350T@\24_\363"..., 37) = 37
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3757) = 3757
read(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\24_\363\0\254\350T@\23_\363"..., 4096) = 175
lseek(6, 814940, SEEK_SET)              = 814940
lseek(6, 814940, SEEK_SET)              = 814940
lseek(6, 814940, SEEK_SET)              = 814940
lseek(6, 814940, SEEK_SET)              = 814940
lseek(6, 814940, SEEK_SET)              = 814940
lseek(6, 814940, SEEK_SET)              = 814940
lseek(6, 814940, SEEK_SET)              = 814940
lseek(6, 814940, SEEK_SET)              = 814940
lseek(6, -138, SEEK_CUR)                = 814802
write(6, "\f", 1)                       = 1
fstat(6, {st_mode=S_IFREG|0664, st_size=814940, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3932) = 3932
write(6, "\r\6\0\0\0\246n\300\212l]e\300\305\240g\372\204\344e\300d\216\25\215\1ve@\357B\276"..., 101) = 101
fstat(6, {st_mode=S_IFREG|0664, st_size=815041, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 4033) = 4033
write(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\24_\363\0\254\350T@\23_\363"..., 37) = 37
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 3895) = 3895
read(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\23_\363\0\254\350T@\24_\363"..., 4096) = 175
lseek(6, 815078, SEEK_SET)              = 815078
lseek(6, 815078, SEEK_SET)              = 815078
lseek(6, 815078, SEEK_SET)              = 815078
lseek(6, 815078, SEEK_SET)              = 815078
lseek(6, -138, SEEK_CUR)                = 814940
write(6, "\f", 1)                       = 1
fstat(6, {st_mode=S_IFREG|0664, st_size=815078, ...}) = 0
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 4070) = 4070
write(6, "\r\6\0\0\0\246n\300\212l]e\300\305\240g\372\204\344e\300d\216\25\215\1", 26) = 26
write(6, "ve@\357B\276I\231$e@\305\240g\372\204\344e\300\246n\300\212l]e\300\23_\363\0\254"..., 75) = 75
fstat(6, {st_mode=S_IFREG|0664, st_size=815179, ...}) = 0
lseek(6, 815104, SEEK_SET)              = 815104
read(6, "ve@\357B\276I\231$e@\305\240g\372\204\344e\300\246n\300\212l]e\300\23_\363\0\254"..., 75) = 75
write(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\23_\363\0\254\350T@\24_\363"..., 37) = 37
lseek(6, 811008, SEEK_SET)              = 811008
read(6, "l\fK\300&U@\304p\327\354y1U\300L\205\334i\322\27U\300\23_\363\0\254\350T@\r"..., 4033) = 4033
read(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\24_\363\0\254\350T@\23_\363"..., 4096) = 175
lseek(6, 815216, SEEK_SET)              = 815216
lseek(6, 815216, SEEK_SET)              = 815216
lseek(6, 815216, SEEK_SET)              = 815216
lseek(6, 815216, SEEK_SET)              = 815216
lseek(6, 815216, SEEK_SET)              = 815216
lseek(6, 815216, SEEK_SET)              = 815216
lseek(6, 815216, SEEK_SET)              = 815216
lseek(6, 815216, SEEK_SET)              = 815216
lseek(6, -138, SEEK_CUR)                = 815078
write(6, "\f", 1)                       = 1
fstat(6, {st_mode=S_IFREG|0664, st_size=815216, ...}) = 0
lseek(6, 815104, SEEK_SET)              = 815104
read(6, "ve@\357B\276I\231$e@\305\240g\372\204\344e\300\246n\300\212l]e\300\23_\363\0\254"..., 112) = 112
write(6, "\r\6\0\0\0\246n\300\212l]e\300\305\240g\372\204\344e\300d\216\25\215\1ve@\357B\276"..., 101) = 101
fstat(6, {st_mode=S_IFREG|0664, st_size=815317, ...}) = 0
lseek(6, 815104, SEEK_SET)              = 815104
read(6, "ve@\357B\276I\231$e@\305\240g\372\204\344e\300\246n\300\212l]e\300\23_\363\0\254"..., 213) = 213
write(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\24_\363\0\254\350T@\23_\363"..., 37) = 37
lseek(6, 815104, SEEK_SET)              = 815104
read(6, "ve@\357B\276I\231$e@\305\240g\372\204\344e\300\246n\300\212l]e\300\23_\363\0\254"..., 75) = 75
read(6, "\r\2\0\0\0\246n\300\212l]e\300\246n\300\212l]e\300\23_\363\0\254\350T@\24_\363"..., 4096) = 175
...

I'm a bit clueless... do you have uncommitted improvements? :-)

Can anyone else please test this tiny GeoJSON in a latlong location? thanks.

comment:29 by hcho, 7 years ago

Markus, I'm having exactly the same issue. Takes forever at that stage...

comment:30 by pvanbosgeo, 7 years ago

Takes a long time on my computer as well, gets stuck at breaking boundaries. I am running 7.3.svn r70733 on Ubuntu 16.04. I don't normally have problems with importing much larger shapefiles.

by hcho, 7 years ago

Attachment: intersect.c.patch added

lib/vector/Vlib/intersect.c.patch

comment:31 by hcho, 7 years ago

Can you try this patch? The issue was that Vect_line_intersection would create breaks on first and last line vertices because there was no tolerance for comparing points. There are more such comparisons (potential issue?)...

comment:32 by hcho, 7 years ago

Not sure if this patch would address the original issue though.

comment:33 by neteler, 7 years ago

In addition, here my compiler settings, maybe that makes a difference?

sh -x conf_grass7.sh 
+ INTEL='-march=native -std=gnu99 -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector -m64'
+ MYGCC=-fdiagnostics-color
+ MYCFLAGS='-Wall -fopenmp -lgomp -Ofast -fno-fast-math -march=core-avx-i -fno-common -fexceptions -march=native -std=gnu99 -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector -m64 MYGCC'
+ MYLDFLAGS='-Wl,--no-undefined -fopenmp -lgomp'
+ MYCXXFLAGS=-Ofast
+ CC=clang
+ CXX=clang++
+ CFLAGS=-O2
+ CXXFLAGS=-O2
+ LDFLAGS=-s
+ ./configure --with-cxx --enable-largefile --with-proj --with-proj-share=/usr/share/proj --with-gdal=/usr/bin/gdal-config --with-python --with-geos --with-liblas --with-sqlite --with-nls --with-blas --with-blas-includes=/usr/include/atlas-x86_64-base/ --with-lapack --with-lapack-includes=/usr/include/atlas-x86_64-base/ --with-cairo --with-cairo-ldflags=-lfontconfig --with-freetype --with-freetype-includes=/usr/include/freetype2 --with-wxwidgets=/usr/bin/wx-config --with-fftw --with-motif --with-postgres --with-netcdf --without-mysql --without-odbc --without-openmp --without-ffmpeg
+ tee config_log.txt
checking host system type... x86_64-pc-linux-gnu
checking for gcc... clang
checking whether the C compiler (clang -O2 -s) works... yes
checking whether the C compiler (clang -O2 -s) is a cross-compiler... no
checking whether we are using GNU C... yes
...

# CPU:
cat /proc/cpuinfo | grep 'model name'
model name	: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
...

clang -v
clang version 3.9.1 (tags/RELEASE_391/final)

I'll switch to gcc and test again.

Last edited 7 years ago by neteler (previous) (diff)

comment:34 by neteler, 7 years ago

Wow. After full recompilation, with gcc it does NOT "hang":

## gcc production
INTEL="-march=native"
MYGCC="-fdiagnostics-color"
MYCFLAGS="-O2 $INTEL $MYGCC"
MYCXXFLAGS="-O2"
MYLDFLAGS="-Wl,--no-undefined"
MYLDFLAGS="-s $MYLDFLAGS"

# clang
#CC=clang CXX=clang++ CFLAGS="-O2" CXXFLAGS="-O2" LDFLAGS="-s" ./configure \

# gcc
LDFLAGS="$MYLDFLAGS" CFLAGS="$MYCFLAGS" CXXFLAGS="$MYCXXFLAGS" ./configure \
  --with-cxx \
  --enable-largefile \
...

Now:

GRASS 7.2.1svn (latlong_wgs84):~ > time -p v.in.ogr in=world_AOI_latlong.geojson out=world_AOI_latlong --o
...
-----------------------------------------------------
1 input polygons
Total area: 2.37675E+13 (2 areas)
Area without category: 2.04916E+11 (1 areas)
-----------------------------------------------------
Copying features...
 100%
Building topology for vector map <world_AOI_latlong@user1>...
Registering primitives...
3 primitives registered
19 vertices registered
Building areas...
 100%
2 areas built
One isle built
Attaching islands...
 100%
Attaching centroids...
 100%
Number of nodes: 1
Number of primitives: 3
Number of points: 0
Number of lines: 0
Number of boundaries: 2
Number of centroids: 1
Number of areas: 2
Number of isles: 1
real 0.09
user 0.04
sys 0.01

So, unfortunate clang compiler settings before or gcc-is-better-than-clang here?

Last edited 7 years ago by neteler (previous) (diff)

comment:35 by hcho, 7 years ago

Markus, you mean it does not hang with or without my patch?

in reply to:  34 comment:36 by mmetz, 7 years ago

Replying to neteler:

Wow. After full recompilation, with gcc it does NOT "hang":

[...]

So, unfortunate clang compiler settings before or gcc-is-better-than-clang here?

No, the difference is -O2 which you used with gcc but not with clang. I could reproduce the problem by not using any optimization.

The problem is that Vect_break_lines() enters an infinite loop because it continuously generates two new lines that are again intersecting each other, always at the same locations.

The attached patch (hcho) helps, but as mentioned in comment:31 there are more comparisons and it does not seem to be a good idea to allow tolerance at some places but not at other places when comparing a cross with a vertex.

What also helps is using Vect_line_intersection2() instead of Vect_line_intersection(). Interestingly, Vect_line_intersection2() does not need the patch, even though that part of the code is identical.

comment:37 by hcho, 7 years ago

Yes, that's what I found too. Vect_line_intersection2 doesn't have this issue, but it still creates a single point intersection at the first vertex of the second new line (line ID 3) in the geojson example.

Line ID 1: original unbroken line

1st iteration Line ID 2-4: new broken lines

2nd iteration Line ID 5: identical to line 3 Line ID 6: start node of line 3

Lines 5 & 6 shouldn't be returned at all from the intersection routine, I think. The patch fixes this.

by mmetz, 7 years ago

Attachment: intersect.c.patch2 added

patch for Vect_segment_intersection()

in reply to:  37 comment:38 by mmetz, 7 years ago

Replying to hcho:

Yes, that's what I found too. Vect_line_intersection2 doesn't have this issue, but it still creates a single point intersection at the first vertex of the second new line (line ID 3) in the geojson example.

Line ID 1: original unbroken line

1st iteration Line ID 2-4: new broken lines

2nd iteration Line ID 5: identical to line 3 Line ID 6: start node of line 3

Lines 5 & 6 shouldn't be returned at all from the intersection routine, I think. The patch fixes this.

I found another issue in Vect_segment_intersection(): the order of the segments matters, i.e. the intersection point of a with b can be slightly different from the intersection point of b with a. The second attached patch fixes that and also avoids that infinite loop. I opt to apply both patches.

in reply to:  35 comment:39 by neteler, 7 years ago

Replying to hcho:

Markus, you mean it does not hang with or without my patch?

Due to lack of time I tried without your patch so far.

in reply to:  37 comment:40 by mmetz, 7 years ago

Replying to hcho:

Yes, that's what I found too. Vect_line_intersection2 doesn't have this issue, but it still creates a single point intersection at the first vertex of the second new line (line ID 3) in the geojson example.

Line ID 1: original unbroken line

1st iteration Line ID 2-4: new broken lines

2nd iteration Line ID 5: identical to line 3 Line ID 6: start node of line 3

Lines 5 & 6 shouldn't be returned at all from the intersection routine, I think. The patch fixes this.

Such an infinite loop in Vect_break_lines() has been observed previously and fixed back then in trunk r55796, r55813, r55848. Apparently, these commits did not solve all issues. It is most important that the intersection point of the same two segments is always the same, no matter which segment is a and which is b.

Your patch might cause problems in special cases when an intersection point is not added to a line even if there are no vertices identical with the intersection point. That is, the line was not broken if it should have been. Therefore I would rather not apply your patch.

comment:41 by neteler, 7 years ago

Milestone: 7.0.67.0.7

comment:42 by martinl, 5 years ago

What is status of this ticket?

Version 0, edited 5 years ago by martinl (next)

in reply to:  42 comment:43 by mmetz, 5 years ago

Resolution: fixed
Status: newclosed

Replying to martinl:

What is state of this ticket?

The original issue was caused by a buggy NFS version which causes sqlite to be very slow. This won't fix in GRASS, the solution is to use a bug-free NFS version.

The second issue with an infinite loop when breaking lines has been fixed in trunk and relbr76.

Note: See TracTickets for help on using tickets.