Changes between Version 1 and Version 2 of LargeFileSupport


Ignore:
Timestamp:
Sep 10, 2016, 10:19:10 PM (8 years ago)
Author:
Nikos Alexandris
Comment:

Moved from GRASS-Wiki

Legend:

Unmodified
Added
Removed
Modified
  • LargeFileSupport

    v1 v2  
    1 See http://grass.osgeo.org/wiki/Large_File_Support (to be moved here)
     1^Note, the following is largely based on comments by Glynn Clements on the GRASS-dev mailing list.
     2
     3== The need ==
     4
     5Standard C `<stdio.h>` file functions return file sizes as long integer. On 32-bit systems this overflows at 2 gigabytes. For support of files bigger than this, you need Large File Support (LFS). In GRASS GIS 6, this is only implemented in `libgis` (i.e. there is support for reading+writing raster maps, but not many import/export modules or vector functions have it). The situation is much better in GRASS GIS 7.
     6
     7In GRASS, `configure.in`/`configure` have been updated to support `--enable-largefile` (based on code from "cdr-tools"). See [http://trac.osgeo.org/grass/browser/grass/trunk/configure.in configure.in] (around line 1610).
     8
     9== The issues ==
     10
     11The problem is that `ftell()` returns the result as a `(signed) long`. If the result won't fit into a long, it returns `-1` (and sets `errno` to `EOVERFLOW`).
     12
     13This can only happen if you also set `_FILE_OFFSET_BITS` to `64` so that `fopen()` is redirected to `fopen64()`, otherwise `fopen()` will simply refuse to open files larger than 2GiB (apparently, this isn't true on some versions of MacOSX, which open the file anyhow then fail on `fseek`/`ftell` once you've passed the 2GiB mark).
     14
     15If you want to obtain the current offset for a file whose size exceeds the range of a `signed long`, you instead have to use the (non-ANSI) `ftello()` function, which returns the offset as an `off_t`.
     16
     17**To Do:** Before we do that, we would need to add configure checks so that we don't try to use `ftello()` on systems which don't provide it.
     18
     19There isn't a truly portable solution. Some platforms might not even have an integral type larger than 32 bits.
     20
     21The most practicaly solution is to use `ftello()` if it's available. This will require some configure checks. These are simple enough to implement; it's the design which is problematic (as usual).
     22
     23Unlike most `HAVE_FOO` checks, `fseeko()` isn't a simple have/don't-have check. Rather, it's usually a case that the function is available only when certain macros are defined (e.g. `_LARGEFILE_SOURCE`).
     24
     25That gives rise to the question of what we check for, how we check for it, how we pass that information to the code, and how we use it.
     26
     27The trick is deciding what to test, and what to indicate. Do we want to know:
     28
     291. Whether `ftello()` exists with the default `CFLAGS/CPPFLAGS`?
     30
     312. Whether `ftello()` exists with the default `CFLAGS/CPPFLAGS` plus a fixed selection of additional switches (e.g. `-D_LARGEFILE_SOURCE`)?
     32
     333. Whether `ftello()` exists with the default `CFLAGS`/`CPPFLAGS` plus a variable selection of additional switches (e.g. `-D_LARGEFILE_SOURCE`) with those switches being communicated via an additional variable?
     34
     35The test would need to be:
     36
     37{{{
     38#if defined(HAVE_FTELLO) && defined(HAVE_FSEEKO)
     39}}}
     40
     41as you are using both of those.
     42
     43Also, if the existence of `FSEEKO`/`FTELLO` is conditional upon certain macros (e.g. `_LARGEFILE_SOURCE`), you have to ensure that those macros are defined before the first inclusion of `<stdio.h>`, including any "hidden" inclusion from other headers, which essentially means that you have to ensure that the macros are defined before you include any other headers (except for `<grass/config.h>`).
     44
     45
     46Rather than try to come up with some infrastructure which allows us to use LFS in a piecemeal fashion, it would be preferable to clean up the GRASS code so that we can enable LFS globally. Then, we can just add:
     47
     48{{{
     49-D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64
     50}}}
     51
     52to `CPPFLAGS`, and not have to worry about adding the necessary macros to individual files. Any `HAVE_`* checks then become simple have/don't-have checks.
     53
     54
     55My inclination would be to add:
     56
     57{{{
     58extern off_t G_ftell(FILE *fp);
     59extern int G_fseek(FILE *stream, off_t offset, int whence);
     60}}}
     61
     62These would be implemented using `fseeko`/`ftello` where available, and `fseek`/`ftell` otherwise. That eliminates the need to perform checks in individual source files. However, we would need to take care not to use them in code which can't handle large offsets (i.e. code which will truncate `off_t` values to `int`/`long`).
     63
     64
     65If you use `-D_FILE_OFFSET_BITS=64`, `fopen()` should be redirected to `fopen64()`. There are reports that the MacOSX `fopen()` is equivalent to `fopen64()` even without additional switches.
     66
     67== Coding LFS in GRASS GIS 6 ==
     68
     69Currently the `--enable-largefile` switch only enables LFS in `libgis`, not anywhere else.
     70
     71[Although `config.h` includes definitions to enable LFS automatically, those definitions are currently inactive. This is probably a good thing; a lot of GRASS' code isn't LFS-aware, and explicit failure is preferable to silently corrupting data.]
     72
     73To enable LFS elsewhere, you need to manually add `-D_FILE_OFFSET_BITS=64` to the compilation flags. The simplest approach is to add to the module's Makefile:
     74
     75{{{
     76ifneq ($(USE_LARGEFILES),)
     77       EXTRA_CFLAGS = -D_FILE_OFFSET_BITS=64
     78endif
     79}}}
     80
     81and add include `config.h` before '''all''' other header files in the code.
     82
     83{{{
     84#include <grass/config.h>
     85#include <stdio.h>
     86#include <string.h>
     87#include <grass/gis.h>
     88 ...
     89}}}
     90
     91=== `int` versus `off_t` ===
     92
     93You may as well just use `off_t filesize` unconditionally. An `off_t` will always be large enough to hold a "long".
     94
     95If using `off_t`, be sure to add:
     96
     97{{{
     98 #include <sys/types.h>
     99}}}
     100
     101== Issues related to import/export of maps ==
     102
     103Some exporting/importing formats have their own intrinsic limitations, see for instance http://www.gdal.org/formats_list.html
     104
     105== GRASS GIS 6: LFS-safe libs and module list ==
     106
     107* libgis
     108
     109* r.in.arc
     110* r.in.ascii
     111* r.out.arc
     112* r.out.ascii
     113* r.proj.seg
     114* r.terraflow
     115
     116== GRASS GIS 6: LFS works in progress ==
     117
     118* r.in.xyz
     119* r.terraflow (intregrate current LFS support into GRASS's `--enable-largefile` `./configure` switch)[[BR]]
     120  (r.terraflow creates huge temporary files which can easily go over 2GB)
     121
     122== GRASS GIS 6: LFS wish list ==
     123
     124'''High priority modules to get LFS'''
     125
     126* r.in.*
     127* r.out.*
     128* GRASS GDAL plugin (??)
     129* v.surf.rst
     130* v.surf.idw(2)
     131* vector libs (limited by number of features)
     132* v.in.ascii -bt  (without topology)
     133* DB libs
     134
     135== Coding LFS in GRASS GIS 7 ==
     136
     137* Already enabled for raster and vector libraries and modules, see http://trac.osgeo.org/grass/wiki/Grass7/NewFeatures
     138
     139'''Q:''' Is xyz module (e.g. r.texture) supporting LFS?
     140
     141'''A:''' Modules typically don't need to do anything regarding LFS; the support is in the libraries. The main issue which affects modules is that they shouldn't assume that cell counts will fit into an `int` or even a `long`. But even failing to do so won't have any effect upon I/O.
     142
     143'''Details:'''
     144
     145The most common way for the issue to arise is multiplying the number of rows by the number of columns to obtain the total number of cells.
     146Most modules have no need to do this, but it is occasionally done e.g. when calculating statistics or storing the data in a temporary file.
     147
     148The number of rows and columns can reasonably be assumed to fit into a signed 32-bit integer, but their product cannot.
     149
     150Even on a system with a 64-bit `long`[1], multiplying 2 `int`s will produce an `int` result, and assigning the result to a `long` variable doesn't change that. E.g.
     151
     152{{{
     153int nrows = Rast_window_rows();
     154int ncols = Rast_window_cols();
     155long ncells = nrows * ncols;
     156}}}
     157
     158will truncate the result of the multiplication to an `int` (which is 32 bits on all mainstream platforms) then expand the truncated value to 64 bits in the assignment. To perform the multiplication using `long`, one of the arguments must be converted, e.g.:
     159
     160{{{
     161long ncells = (long) nrows * ncols;
     162}}}
     163
     164[1] This doesn't include 64-bit versions of Windows, where `long` is only 32 bits for compatibility reasons.
     165
     166The issue isn't strictly related to LFS; due to compression, it's possible for a raster with more than 2^31^ cells to take up less than 2 GiB on disk. LFS just makes the issue more likely to arise in practice.
     167
     168== References ==
     169
     170* [http://www.suse.de/~aj/linux_lfs.html LFS support in Linux]
     171* [http://opengroup.org/platform/lfs.html Adding Large File Support to the Single UNIX® Specification]
     172* [http://www.sun.com/software/whitepapers/wp-largefiles/largefiles.pdf Large Files in Solaris: A White Paper]