Opened 14 years ago

Closed 12 years ago

#393 closed defect (fixed)

shp2pgsql returns "fseek(-xxx) failed on DBF file." for large (>2GB) DBF files

Reported by: maximeguillaud Owned by: pramsey
Priority: medium Milestone: PostGIS 2.0.0
Component: utils/loader-dumper Version: 1.5.X
Keywords: shp2pgsql fseek failed large file Cc: arencambre

Description

Running shp2pgsql on large files fails when the .DBF is over 231 bytes. It throws multiple messages such as "fseek(-2124469057) failed on DBF file." (with the quoted offset changing).

An example of such large files is the europe_highway.shp in the archive at http://downloads.cloudmade.com/europe/europe.shapefiles.zip. These files are OpenStreetMap data.

The attached patch fixes the problem for me on a Linux/amd64 platform.

Attachments (3)

int_to_long.diff (730 bytes ) - added by maximeguillaud 14 years ago.
large_dbf.v1.patch (617 bytes ) - added by dfuhry2 12 years ago.
seeko.patch (830 bytes ) - added by pramsey 12 years ago.
How about this? Wonder what platforms fseeko does nto exist on…

Download all attachments as: .zip

Change History (24)

by maximeguillaud, 14 years ago

Attachment: int_to_long.diff added

comment:1 by pramsey, 14 years ago

I think that using fseeko and off_t is a better approach, hopefully one that can be used even on 32-bit operating systems, if we are careful.

comment:2 by pramsey, 14 years ago

Milestone: PostGIS 1.4.2PostGIS 1.4.3

comment:3 by pramsey, 14 years ago

I've applied the patch as a temporary measure into the 1.5 branch at r5787. This still needs a clean resolution for trunk and going forward.

comment:4 by pramsey, 14 years ago

The fix for mapserver is lightweight and correct, worth copying.

http://trac.osgeo.org/mapserver/ticket/3514

comment:5 by arencambre, 13 years ago

Cc: arencambre added
Milestone: PostGIS 1.4.3
Version: 1.5.X

Also getting this with 1.5.2 on Windows 7 32 bit. Windows Explorer says my DBF is 2,147,484,046 bytes. Dividing by 1024 three times comes out to 2.0000003GB. Hmmm…

comment:6 by arencambre, 13 years ago

Milestone: PostGIS 1.5.3

Can I be optimistic? Seems like simple fix.

comment:7 by strk, 13 years ago

Component: postgisloader/dumper

comment:8 by robe, 13 years ago

Milestone: PostGIS 1.5.3PostGIS 2.0.0

comment:9 by arencambre, 13 years ago

http://bugzilla.maptools.org/show_bug.cgi?id=1463 just got marked as resolved, and it may be related to this.

comment:10 by dfuhry2, 12 years ago

Regarding previous patches, changing "int" to "long" has no effect on Linux at least, since both are 4 bytes with a 32-bit OS and 8 bytes with a 64-bit OS. One can write a simple C program to printf("sizeof(int): %d, sizeof(long): %d\n", sizeof(int), sizeof(long)) to verify.

My attached 6-line large_dbf.v1.patch fixes my problems writing > 2GB DBF files with shp2pgsql on 32-bit Linux.

  1. Add -D_FILE_OFFSET_BITS=64 to Makefile's CFLAGS, which makes off_t an unsigned 8 byte integer (equivalent to unsigned long long type).
  2. In shapefil.h, change typedef of SAOffset from unsigned long (which is 4-byte on 32-bit systems) to off_t.
  3. In safileio.c, change fseek call to fseeko, and cast offset to off_t rather than long. fseek always takes a long, and a long can't be made 8 bytes on a 32-bit system.

by dfuhry2, 12 years ago

Attachment: large_dbf.v1.patch added

comment:11 by gdt, 12 years ago

Please keep in mind that _FILE_OFFSET_BITS is nonportable; it isn't used on BSDs, where off_t is simply always int64_t. But using off_t and lseeko instead of long seems entirely sensible. So probably the -D_FILE_OFFSET_BITS=64 needs to be wrapped in a configure test to be added only on operating systems where it makes sense.

comment:12 by mcayland, 12 years ago

Given that we recently re-synced with upstream, I don't really want us to be maintaining a fork once again :( How does upstream shapelib handle this issue? I'd much rather you got the patches tested/accept there and then we can simply re-sync.

comment:13 by dfuhry2, 12 years ago

I filed a bug in ShapeLib's bug tracker: http://bugzilla.maptools.org/show_bug.cgi?id=2359

w.r.t. _FILE_OFFSET_BITS nonportability, to be less intrusive, instead of specifying -D_FILE_OFFSET_BITS=64 in the Makefile's CFLAGS, we could just #define _FILE_OFFSET_BITS 64 at the top of pgsql2shp.c, before any #includes as described here: http://www.gnu.org/software/libc/manual/html_node/Feature-Test-Macros.html

comment:14 by pramsey, 12 years ago

Shapelib 1.3.0b has this

#ifndef SAOffset
typedef unsigned long SAOffset;
#endif

which seems reasonable enough to be. An unsigned integer gets us to 4GB, so we max out 32 bit systems. The definition of off_t at least on OSX is of a signed value, so that means we could support large files on 64bit but not on 32bit. Not good enough? As Mark says, our best bet is just to track shapelib closely.

comment:15 by pramsey, 12 years ago

Poking around shplib, it looks like we still have some patches it does not, around date handling in particular.

comment:16 by pramsey, 12 years ago

I've updated to the latest shplib in trunk at r8919.

I wonder, does this bug refer to trunk anymore? We should go up quite large using "unsigned long" as our offset, and work on both 32 and 64 bits. Perhaps this problem is only a 1.5 problem now. Comments?

comment:17 by dfuhry2, 12 years ago

OSX is not a problem, since off_t and unsigned long are both 64bit integers on 32 bit Macs: $ uname -a && cat sizeof_types.c && gcc sizeof_types.c && ./a.out Darwin laptop 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 #include <stdio.h> int main(int argc, char argv) { printf("sizeof(off_t): %lu, sizeof(unsigned long): %lu, sizeof(int): %lu\n", sizeof(off_t), sizeof(unsigned long), sizeof(int)); }

32-bit Linux will still have the 2GB limitation against trunk, since off_t (by default, without _FILE_OFFSET_BITS=64) and unsigned long are both 32bit integers: $ uname -a && cat sizeof_types.c && gcc sizeof_types.c && ./a.out Linux host 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux #include <stdio.h> int main(int argc, char argv) { printf("sizeof(off_t): %lu, sizeof(unsigned long): %lu, sizeof(int): %lu\n", sizeof(off_t), sizeof(unsigned long), sizeof(int)); } sizeof(off_t): 4, sizeof(unsigned long): 4, sizeof(int): 4

comment:18 by dfuhry2, 12 years ago

Last line of OSX output got truncated, here it is again:

$ uname -a && cat sizeof_types.c && gcc sizeof_types.c && ./a.out
Darwin laptop 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386
#include <stdio.h>
int main(int argc, char **argv) { printf("sizeof(off_t): %lu, sizeof(unsigned long): %lu, sizeof(int): %lu\n", sizeof(off_t), sizeof(unsigned long), sizeof(int)); }
sizeof(off_t): 8, sizeof(unsigned long): 8, sizeof(int): 4

comment:19 by pramsey, 12 years ago

Since SAOffset is unsigned long, won't we end up with a 4GB limit on 32bit linux? That's still 2x bigger than our limit before… should we aim yet higher?

comment:20 by dfuhry2, 12 years ago

Paul, in safileio.c's SADFSeek, the SAOffset value gets casted to (signed) long, making the limit 2GB on 32bit Linux.

SAOffset SADFSeek( SAFile file, SAOffset offset, int whence )
{
    return (SAOffset) fseek( (FILE *) file, (long) offset, whence );
}

by pramsey, 12 years ago

Attachment: seeko.patch added

How about this? Wonder what platforms fseeko does nto exist on…

comment:21 by pramsey, 12 years ago

Resolution: fixed
Status: newclosed

OK, hearing nothing more I've taken the original dfurhry patch and wrapped in some autoconf testing for fseeko to hopefully guard platforms that don't have it.

I've also pulled our shapelib up to the b3 release, but it still has custom stuff for our date and boolean handling. I've submitted patches on those changes into the shapelib tracker, so maybe some day in the future we'll be able to track a clean shapelib.

In trunk at r8967.

Note: See TracTickets for help on using tickets.