Changes between Initial Version and Version 1 of FGDBSpecification


Ignore:
Timestamp:
Oct 8, 2013, 2:00:28 PM (11 years ago)
Author:
Even Rouault
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FGDBSpecification

    v1 v1  
     1[[PageOutline]]
     2
     3= Introduction =
     4
     5This is a work-in-progress reverse-engineered specification of .gdbtable and .gdbtablx files found in FileGDB datasets.
     6
     7= Conventions =
     8
     9ubyte: unsigned byte
     10int16: little-endian 16-bit integer
     11int32: little-endian 32-bit integer
     12float64: little-endian 64-bit IEEE754 floating point number
     13utf16: string in little-endian UTF-16 encoding
     14string: (UTF-8 ?) string
     15
     16A row or a feature are synonyms in this document.
     17
     18= Specification of .gdbtable files =
     19
     20.gdbtable files describe fields and contain row data.
     21
     22They are made of an header, a section describing the fields, and a section describing the rows.
     23
     24== Header (40 bytes) ==
     25
     26||'''Format'''||'''Content'''||
     27|| 4 bytes || 0x03 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ? ||
     28|| int32 || number of (valid) rows ||
     29|| 4 bytes || varying values - unknown role ||
     30|| 4 bytes || 0x05 0x00 0x00 0x00 - unknown role. Constant among the files ||
     31|| 4 bytes || varying values - unknown role. Seems to be 0x00 0x00 0x00 0x00 for FGDB 10 files, but not for earlier versions ||
     32|| 4 bytes || 0x00 0x00 0x00 0x00 - unknown role. Constant among the files ||
     33|| int32 || file size in bytes ||
     34|| 4 bytes || 0x00 0x00 0x00 0x00 - unknown role. Constant among the files ||
     35|| int32 || offset in bytes at which the field description section begins, often (in FGDB 10) 0x28 0x00 0x00 0x00, i.e. 40 ||
     36|| 4 bytes || 0x00 0x00 0x00 0x00 - unknown role. Constant among the files ||
     37
     38== Field description section ==
     39
     40=== Fixed part ===
     41
     42||'''Format'''||'''Content'''||
     43|| int32 || size of header (this field excluded in bytes) ||
     44|| int32 || version of the file ? Seems to be 3 for FGDB 9.X files and 4 for FGDB 10.X files ||
     45|| ubyte || layer geometry type. 1 = point, 2 = multipoint, 3= (multi)polyline, 4 = (multi)polygon ||
     46|| 3 bytes || 0x03 0x00 0x00 - unknown role ||
     47|| int16 || number of fields (including geometry field and implicit OBJECTID field) ||
     48
     49=== Repeated part (per field) ===
     50
     51Following immediately: the description of the fields (repeated as many times as the number of fields)
     52
     53||'''Format'''||'''Content'''||
     54|| ubyte || number of UTF-16 characters (not bytes) of the name of the field ||
     55|| utf16 || name of the field ||
     56|| ubyte || number of UTF-16 characters (not bytes) of the alias of the field. Might be 0 ||
     57|| utf16 || alias of the field (ommitted if previous field is 0) ||
     58|| ubyte || 0x00 ||
     59|| ubyte || field type ( 0 = int16, 1 = int32, 2 = float32, 3 = float64, 4 = string, 5 = datetime, 6 = objectid, 7 = geometry, 8 = binary, 10 = ?, 11 = UUID, 12 = ?  ) ||
     60
     61The next bytes for the field description depend on the field type.
     62
     63For field type = 4 (string),
     64
     65||'''Format'''||'''Content'''||
     66|| int32 || maximum length of string ||
     67|| ubyte || unknown role ||
     68|| ubyte || unknown role ||
     69
     70For field type = 6 (objectid),
     71
     72||'''Format'''||'''Content'''||
     73|| ubyte || unknown role = 4 ||
     74|| ubyte || unknown role = 2 ||
     75
     76For field type = 7 (geometry),
     77
     78||'''Format'''||'''Content'''||
     79|| ubyte || unknown role = 0 ||
     80|| ubyte || unknown role = 6 or 7 ||
     81|| int16 || length (in bytes) of the WKT string describing the SRS. ||
     82|| string || WKT string describing the SRS Or "{B286C06B-0879-11D2-AACA-00C04FA33C20}" for no SRS . ||
     83|| ubyte || "magic" (used after). Value is generally 5 or 7 ||
     84|| float64 || xorigin ||
     85|| float64 || yorigin ||
     86|| float64 || xyscale ||
     87|| float64 || zorigin ||
     88|| float64 || zscale ||
     89|| float64 || morigin (omitted if magic = 5) ||
     90|| float64 || mscale (omitted if magic = 5) ||
     91|| float64 || xytolerance ||
     92|| float64 || ztolerance ||
     93|| float64 || mtolerance (omitted if magic = 5) ||
     94|| float64 || xmin of layer extent ||
     95|| float64 || ymin of layer extent ||
     96|| float64 || xmax of layer extent ||
     97|| float64 || ymax of layer extent ||
     98
     99The organization of following bytes is a bit messy and seems to comply to the following algorithm :
     1001) Store current offset
     1012) Skip one byte
     1023) Read int32 value "magic2".
     103    a) if magic2 = 0, then rewind to the stored offset and read 2 float64 (that happen to be NaN values). And then go to 2)
     104    b) otherwise (generally magic2 = 1 or magic2 = 3), skip magic2 x float64 values
     105
     106For field type = 8 (binary),
     107
     108||'''Format'''||'''Content'''||
     109|| ubyte || unknown role ||
     110|| ubyte || unknown role ||
     111
     112For field type = 10, 11, 12,
     113
     114||'''Format'''||'''Content'''||
     115|| ubyte || width : 38 ||
     116|| ubyte || unknown role ||
     117
     118
     119For other field types,
     120
     121||'''Format'''||'''Content'''||
     122|| ubyte || width in bytes (e.g. 2 for int16, 4 for int32, 4 for float32, 8 for float64, 8 for datetime) ||
     123|| ubyte || unknown role ||
     124|| ubyte || unknown role ||
     125
     126== Rows section ==
     127
     128The rows section does not necessarily immediately follow the last field description. It starts generally a few bytes after,
     129but not in a predictable way. Note : for FGDB layers created by the ESRI FGDB SDK API, there are 4 bytes between the end of the field description
     130section and the beginning of the rows section : 0xDE 0xAD 0xBE 0xEF (!)
     131
     132The rows section is a sequence of X rows (where X is the total number of features found in the .gdbtablx, which might
     133be different from the number of valid rows found in the header of the .gdbtable). Each row starts at an offset
     134indicated in the .gdbtablx file
     135
     136== Row description ==
     137
     138||'''Format'''||'''Content'''||
     139|| int32 || length in bytes of the row blob ( this field excluded) ||
     140|| ceil(number_fields / 8) * ubyte || flags describing if a field is null. See below explanation ||
     141
     142=== Null fields flags ===
     143
     144Each bit of the flags field encode for the presence or absence of the field content for the row.
     145The flag is set to 1 if the field is missing/null, or 0 if the field is present/non-null (0 is used as well for spare bytes).
     146The flag for the first field, in the order of the fields of the field description section (typically the geometry),
     147is the least significant bit of the last byte of the flags field.
     148
     149Note: there's no explicit data for OBJECTID and no flag bit for it. It must be ignored when considering
     150the list of fields (for number_fields value in particular).
     151
     152For each non-null field, the field content is appended in the order of the fields of the field
     153description section.
     154
     155=== Field content ===
     156
     157==== Geometry field ====
     158
     159This field is generally called "SHAPE".
     160
     161Geometry blobs use 2 new encoding schemes :
     162  * varuint : a sequence of bytes [b0, b1, ... bN]. All bytes except last one have their msb (most significant bit) set to 1. The presence of a msb = 0 marks the end of the sequence. The value of the varuint is (b0 & 0x7F) | ((b1 & 0x7F) << 7) | ((b2 & 0x7F) << 14 | ... | (bN & 0x7F) << (7 * N). Note that a valid sequence might be just 1 byte.
     163  * varint : same concept as varuint. But the 2nd most significant bit of b0 (i.e. the one obtained by masking with 0x40) indicates the sign of the result, and should be ignored in the computation of the unsigned value. If the bit sign is set to 1, the value must be negated.
     164
     165||'''Format'''||'''Content'''||
     166|| varuint || length of the geometry blob in bytes (this field excluded) ||
     167|| ubyte || geometry_type. 1 = 2D point, 3 = 2D (multi)linestring, 5 = 2D (multi)polygon. Other values possible. See SHPT_ enumaration of [http://svn.osgeo.org/gdal/trunk/gdal/ogr/ogrpgeogeometry.h ogrpgeogeometry.h] ||
     168
     169===== For point geometries (geometry type = 1, 9, 21, 11) =====
     170||'''Format'''||'''Content'''||
     171|| varuint || x = (varuint + xorigin * xyscale) / xyscale ||
     172|| varuint || y = (varuint + yorigin * xyscale) / xyscale ||
     173|| varuint ( present only if Z component ) || z = (varuint + zorigin * zscale) / zscale ||
     174|| varuint ( present only if M component ) || m = (varuint + morigin * mscale) / mscale ||
     175
     176 ===== For multipoint geometries (geometry type = 8, 20, 28, 18) =====
     177||'''Format'''||'''Content'''||
     178|| varuint || number of points ||
     179
     180followed by points coordinates:
     181
     182  * First point (i = 0):
     183||'''Format'''||'''Content'''||
     184|| varuint || x[0] = (varuint + xorigin * xyscale) / xyscale ||
     185|| varuint || y[0] = (varuint + yorigin * xyscale) / xyscale ||
     186|| varuint ( present only if Z component ) || z[0] = (varuint + zorigin * zscale) / zscale ||
     187|| varuint ( present only if M component ) || m[0] = (varuint + morigin * mscale) / mscale ||
     188
     189  * For each next point (i > 0) (with dx = dy = dz = dm = 0 at initialization):
     190||'''Format'''||'''Content'''||
     191|| varint || dx = dx + varint. x[i] = x[0] + dx / xyscale ||
     192|| varint ( present only if Z component ) || dz = dz + varint. z[i] = z[0] + dz / zscale ||
     193|| varint ( present only if Z component ) || dm = dm + varint. m[i] = m[0] + dy / mscale ||
     194
     195===== For (multi)linestring (geometry type = 3, 10, 23, 13) or (multi)polygon (geometry type = 5, 19, 25, 15) =====
     196
     197||'''Format'''||'''Content'''||
     198|| varuint || total number of points of all following parts ||
     199|| varuint || number of parts, i.e. number of rings for (multi)polygon - inner and outer rings being at the same level, number of linestrings or a multilinestring, or 1 for a linestring) ||
     200|| varuint || number of points of first part ||
     201|| ... || ... ||
     202|| varuint || number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers ||
     203
     204followed by, for each part, points coordinates:
     205
     206  * First point of first part :
     207||'''Format'''||'''Content'''||
     208|| varuint || x[0] = (varuint + xorigin * xyscale) / xyscale ||
     209|| varuint || y[0] = (varuint + yorigin * xyscale) / xyscale ||
     210|| varuint ( present only if Z component ) || z[0] = (varuint + zorigin * zscale) / zscale ||
     211|| varuint ( present only if M component ) || m[0] = (varuint + morigin * mscale) / mscale ||
     212
     213  * For each next point (other points of the first part, or for all points of the following parts) :
     214||'''Format'''||'''Content'''||
     215|| varint || dx = dx + varint. x[i] = x[0] + dx / xyscale ||
     216|| varint ( present only if Z component ) || dz = dz + varint. z[i] = z[0] + dz / zscale ||
     217|| varint ( present only if Z component ) || dm = dm + varint. m[i] = m[0] + dy / mscale ||
     218
     219==== String ====
     220
     221Number of bytes of the string, followed by string content
     222
     223==== Other types ====
     224
     225a int16 value for a int16 field, a int32 for a int32 field, etc..
     226
     227Note : datetime values are the number of seconds since 30th dec 1899 00:00:00, encoded as float64
     228
     229
     230= Specification of .gdbtablx files =
     231
     232.gdbtablx files contain the offset of the rows of the associated .gdbtable file.
     233
     234== Header (16 bytes) ==
     235
     236||'''Format'''||'''Content'''||
     237|| 4 bytes || 0x03 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ? ||
     238|| 4 bytes || 0x01 0x00 0x00 0x00 (for GDB 10?), 0x03 0x00 0x00 0x00 (for GDB 9?) - unknown role.  ||
     239|| int32 || number of rows, included deleted rows ||
     240|| 4 bytes || 0x05 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ? ||
     241
     242== Offset section ==
     243
     244The section starts immediately after the header (at offset 16) and is made of 5 x number_rows bytes. For each row,
     245
     246||'''Format'''||'''Content'''||
     247|| int32 || offset of the beginning of the row in the .gdbtable file, or 0 if the row is deleted ||
     248|| ubyte || constant to 0. unknown role ||
     249
     250== Padding section ==
     251
     252A lot of bytes to 0.
     253
     254== Trailing section ==
     255
     256The last few bytes look like 00 00 00 00 X 00 00 00 X 00 00 00 00 00 00 00 where X is non 0 (often 1). Unknown role