Changes between Version 15 and Version 16 of FGDBSpecification


Ignore:
Timestamp:
Oct 11, 2013, 2:21:40 PM (11 years ago)
Author:
Even Rouault
Comment:

Specification moved

Legend:

Unmodified
Added
Removed
Modified
  • FGDBSpecification

    v15 v16  
    33= Introduction =
    44
    5 This is a work-in-progress reverse-engineered specification of .gdbtable and .gdbtablx files found in FileGDB datasets.
    6 It applies to FileGDB datasets v10, as well as earlier versions.
    7 
    8 = Conventions =
    9 
    10  * ubyte: unsigned byte
    11  * int16: little-endian 16-bit integer
    12  * int32: little-endian 32-bit integer
    13  * float64: little-endian 64-bit IEEE754 floating point number
    14  * utf16: string in little-endian UTF-16 encoding
    15  * string: (UTF-8 ?) string
    16 
    17 A row or a feature are synonyms in this document.
    18 
    19 = Specification of .gdbtable files =
    20 
    21 .gdbtable files describe fields and contain row data.
    22 
    23 They are made of an header, a section describing the fields, and a section describing the rows.
    24 
    25 == Header (40 bytes) ==
    26 
    27 ||'''Format'''||'''Content'''||
    28 || 4 bytes || 0x03 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ? ||
    29 || int32 || number of (valid) rows ||
    30 || 4 bytes || varying values - unknown role ||
    31 || 4 bytes || 0x05 0x00 0x00 0x00 - unknown role. Constant among the files ||
    32 || 4 bytes || varying values - unknown role. Seems to be 0x00 0x00 0x00 0x00 for FGDB 10 files, but not for earlier versions ||
    33 || 4 bytes || 0x00 0x00 0x00 0x00 - unknown role. Constant among the files ||
    34 || int32 || file size in bytes ||
    35 || 4 bytes || 0x00 0x00 0x00 0x00 - unknown role. Constant among the files ||
    36 || int32 || offset in bytes at which the field description section begins (often 40 in FGDB 10) ||
    37 || 4 bytes || 0x00 0x00 0x00 0x00 - unknown role. Constant among the files ||
    38 
    39 == Field description section ==
    40 
    41 === Fixed part ===
    42 
    43 ||'''Format'''||'''Content'''||
    44 || int32 || size of header in bytes (this field excluded) ||
    45 || int32 || version of the file ? Seems to be 3 for FGDB 9.X files and 4 for FGDB 10.X files ||
    46 || ubyte || layer geometry type. 1 = point, 2 = multipoint, 3= (multi)polyline, 4 = (multi)polygon ||
    47 || 3 bytes || 0x03 0x00 0x00 - unknown role ||
    48 || int16 || number of fields (including geometry field and implicit OBJECTID field) ||
    49 
    50 === Repeated part (per field) ===
    51 
    52 Following immediately: the description of the fields (repeated as many times as the number of fields)
    53 
    54 ||'''Format'''||'''Content'''||
    55 || ubyte || number of UTF-16 characters (not bytes) of the name of the field ||
    56 || utf16 || name of the field ||
    57 || ubyte || number of UTF-16 characters (not bytes) of the alias of the field. Might be 0 ||
    58 || utf16 || alias of the field (omitted if previous field is 0) ||
    59 || ubyte || field type ( 0 = int16, 1 = int32, 2 = float32, 3 = float64, 4 = string, 5 = datetime, 6 = objectid, 7 = geometry, 8 = binary, 10/11 = UUID, 12 = XML ) ||
    60 
    61 The next bytes for the field description depend on the field type.
    62 
    63 For field type = 4 (string),
    64 
    65 ||'''Format'''||'''Content'''||
    66 || int32 || maximum length of string ||
    67 || ubyte || flag ||
    68 || ubyte || unknown role ||
    69 
    70 For field type = 6 (objectid),
    71 
    72 ||'''Format'''||'''Content'''||
    73 || ubyte || unknown role = 4 ||
    74 || ubyte || unknown role = 2 ||
    75 
    76 For field type = 7 (geometry),
    77 
    78 ||'''Format'''||'''Content'''||
    79 || ubyte || unknown role = 0 ||
    80 || ubyte || unknown role = 6 or 7 ||
    81 || int16 || length (in bytes) of the WKT string describing the SRS. ||
    82 || string || WKT string describing the SRS Or "{B286C06B-0879-11D2-AACA-00C04FA33C20}" for no SRS . ||
    83 || ubyte || "magic" (used after). Value is generally 5 or 7 (or 1 in system tables) ||
    84 || float64 || xorigin ||
    85 || float64 || yorigin ||
    86 || float64 || xyscale ||
    87 || float64 || zorigin (omitted if magic = 1) ||
    88 || float64 || zscale (omitted if magic = 1)  ||
    89 || float64 || morigin (omitted if magic = 1 or 5) ||
    90 || float64 || mscale (omitted if magic = 1 or 5) ||
    91 || float64 || xytolerance ||
    92 || float64 || ztolerance (omitted if magic = 1)  ||
    93 || float64 || mtolerance (omitted if magic = 1 or 5) ||
    94 || float64 || xmin of layer extent (omitted if magic = 1)  ||
    95 || float64 || ymin of layer extent (omitted if magic = 1)  ||
    96 || float64 || xmax of layer extent (omitted if magic = 1)  ||
    97 || float64 || ymax of layer extent (omitted if magic = 1)  ||
    98 
    99 If magic > 1, there are extra bytes whose organization seems to comply to the following algorithm :
    100   1. Store current offset
    101   2. Skip one byte
    102   3. Read int32 value "magic2".
    103       a. if magic2 = 0, then rewind to the stored offset and read 2 float64 (that happen to be NaN values). And then go to 2)
    104       b. otherwise (generally magic2 = 1 or magic2 = 3), skip magic2 x float64 values
    105 
    106 
    107 
    108 For field type = 8 (binary),
    109 
    110 ||'''Format'''||'''Content'''||
    111 || ubyte || unknown role ||
    112 || ubyte || unknown role ||
    113 
    114 For field type = 10, 11
    115 
    116 ||'''Format'''||'''Content'''||
    117 || ubyte || width : 38 ||
    118 || ubyte || flag ||
    119 
    120 For field type = 12
    121 
    122 ||'''Format'''||'''Content'''||
    123 || ubyte || width : 0 ||
    124 || ubyte || flag ||
    125 
    126 For other field types,
    127 
    128 ||'''Format'''||'''Content'''||
    129 || ubyte || width in bytes (e.g. 2 for int16, 4 for int32, 4 for float32, 8 for float64, 8 for datetime) ||
    130 || ubyte || flag ||
    131 || ubyte || unknown role ||
    132 
    133 If the lsb of the flag field (when present) is set to 1, then the field can be null.
    134 
    135 FIXME: find which byte is the flag field for geometry fields. They are supposed to be nullable for now.
    136 
    137 == Rows section ==
    138 
    139 The rows section does not necessarily immediately follow the last field description. It starts generally a few bytes after,
    140 but not in a predictable way. Note : for FGDB layers created by the ESRI FGDB SDK API, there are 4 bytes between the end of the field description
    141 section and the beginning of the rows section : 0xDE 0xAD 0xBE 0xEF (!)
    142 
    143 The rows section is a sequence of X rows (where X is the total number of features found in the .gdbtablx, which might
    144 be different from the number of valid rows found in the header of the .gdbtable). Each row starts at an offset
    145 indicated in the .gdbtablx file
    146 
    147 == Row description ==
    148 
    149 ||'''Format'''||'''Content'''||
    150 || int32 || length in bytes of the row blob ( this field excluded) ||
    151 || ceil(number_nullable_fields / 8) * ubyte || flags describing if a field is null. See below explanation ||
    152 
    153 === Null fields flags ===
    154 
    155 Each bit of the flags field encode for the presence or absence of the field content, for a nullable field, for the row.
    156 The flag is set to 1 if the field is missing/null, or 0 if the field is present/non-null (0 is used as well for spare bytes).
    157 The flag for the first field, in the order of the fields of the field description section (typically the geometry),
    158 is the least significant bit of the first byte of the flags field.
    159 
    160 There are no bits reserved for non-nullable fields.
    161 
    162 If all fields are non-nullable, the flag field is absent.
    163 
    164 Note: there's no explicit data for OBJECTID and no reserved flag bit for it.
    165 
    166 For each non-null field, the field content is appended in the order of the fields of the field
    167 description section.
    168 
    169 === Field content ===
    170 
    171 ==== Geometry field (type = 7) ====
    172 
    173 This field is generally called "SHAPE".
    174 
    175 Geometry blobs use 2 new encoding schemes :
    176   * varuint (64 bit): a sequence of bytes [b0, b1, ... bN]. All bytes except last one have their msb (most significant bit) set to 1. The presence of a msb = 0 marks the end of the sequence. The value of the varuint is (b0 & 0x7F) | ((b1 & 0x7F) << 7) | ((b2 & 0x7F) << 14) | ... | ((bN & 0x7F) << (7 * N)). Note that a valid sequence might be just 1 byte.
    177   * varint (64 bit): same concept as varuint. But the 2nd most significant bit of b0 (i.e. the one obtained by masking with 0x40) indicates the sign of the result, and should be ignored in the computation of the unsigned value : (b0 & 0x3F) | ((b1 & 0x7F) << 6) | ((b2 & 0x7F) << 13) | ... | ((bN & 0x7F) << (7 * N - 1)). If the bit sign is set to 1, the value must be negated.
    178 
    179 ===== Common preambule to all geometry types =====
    180 
    181 ||'''Format'''||'''Content'''||
    182 || varuint || length of the geometry blob in bytes (this field excluded) ||
    183 || ubyte || geometry_type. 1 = 2D point, 3 = 2D (multi)linestring, 5 = 2D (multi)polygon. Other values possible. See SHPT_ enumaration of [http://svn.osgeo.org/gdal/trunk/gdal/ogr/ogrpgeogeometry.h ogrpgeogeometry.h] ||
    184 
    185 The bytes of the geometry blob following this preamble depend of course on the geometry type.
    186 
    187 ===== For point geometries (geometry type = 1, 9, 21, 11) =====
    188 ||'''Format'''||'''Content'''||
    189 || varuint || x = (varuint + xorigin * xyscale) / xyscale ||
    190 || varuint || y = (varuint + yorigin * xyscale) / xyscale ||
    191 || varuint ( present only if Z component ) || z = (varuint + zorigin * zscale) / zscale ||
    192 || varuint ( present only if M component ) || m = (varuint + morigin * mscale) / mscale ||
    193 
    194  ===== For multipoint geometries (geometry type = 8, 20, 28, 18) =====
    195 ||'''Format'''||'''Content'''||
    196 || varuint || number of points ||
    197 
    198 followed by points coordinates:
    199 
    200   * First point (i = 0):
    201 ||'''Format'''||'''Content'''||
    202 || varuint || x[0] = (varuint + xorigin * xyscale) / xyscale ||
    203 || varuint || y[0] = (varuint + yorigin * xyscale) / xyscale ||
    204 || varuint ( present only if Z component ) || z[0] = (varuint + zorigin * zscale) / zscale ||
    205 || varuint ( present only if M component ) || m[0] = (varuint + morigin * mscale) / mscale ||
    206 
    207   * For each next point (i > 0) (with dx = dy = dz = dm = 0 at initialization):
    208 ||'''Format'''||'''Content'''||
    209 || varint || dx = dx + varint. x[i] = x[0] + dx / xyscale ||
    210 || varint ( present only if Z component ) || dz = dz + varint. z[i] = z[0] + dz / zscale ||
    211 || varint ( present only if Z component ) || dm = dm + varint. m[i] = m[0] + dy / mscale ||
    212 
    213 ===== For (multi)linestring (geometry type = 3, 10, 23, 13) or (multi)polygon (geometry type = 5, 19, 25, 15) =====
    214 
    215 ||'''Format'''||'''Content'''||
    216 || varuint || total number of points of all following parts ||
    217 || varuint || number of parts, i.e. number of rings for (multi)polygon - inner and outer rings being at the same level, number of linestrings or a multilinestring, or 1 for a linestring) ||
    218 || varuint || number of points of first part (omitted if there is only one part) ||
    219 || ... || ... ||
    220 || varuint || number of points of (number of parts - 1)th part (number of points of last part can be computed by substracting total number of points with the sum of the above numbers ||
    221 
    222 followed by, for each part, points coordinates:
    223 
    224   * First point of first part :
    225 ||'''Format'''||'''Content'''||
    226 || varint || x[0] = (varint + xorigin * xyscale) / xyscale ||
    227 || varint || y[0] = (varint + yorigin * xyscale) / xyscale ||
    228 || varint ( present only if Z component ) || z[0] = (varint + zorigin * zscale) / zscale ||
    229 || varint ( present only if M component ) || m[0] = (varint + morigin * mscale) / mscale ||
    230 
    231 Note: varint here (and not varuint like in point or multipoint). Rationale is unclear, but there are samples that seem to be indicate that it's like that.
    232 
    233   * For each next point (other points of the first part, or for all points of the following parts) :
    234 ||'''Format'''||'''Content'''||
    235 || varint || dx = dx + varint. x[i] = x[0] + dx / xyscale ||
    236 || varint ( present only if Z component ) || dz = dz + varint. z[i] = z[0] + dz / zscale ||
    237 || varint ( present only if Z component ) || dm = dm + varint. m[i] = m[0] + dy / mscale ||
    238 
    239 ==== String (type=4) or XML (type=12) ====
    240 
    241 Number of bytes of the string as a varuint, followed by string content
    242 
    243 ==== Other types ====
    244 
    245 a int16 value for a int16 field, a int32 for a int32 field, etc..
    246 
    247 Note : datetime values are the number of days since 30th dec 1899 00:00:00, encoded as float64
    248 
    249 
    250 = Specification of .gdbtablx files =
    251 
    252 .gdbtablx files contain the offset of the rows of the associated .gdbtable file.
    253 
    254 == Header (16 bytes) ==
    255 
    256 ||'''Format'''||'''Content'''||
    257 || 4 bytes || 0x03 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ? ||
    258 || 4 bytes || 0x01 0x00 0x00 0x00 (for GDB 10?), 0x03 0x00 0x00 0x00 (for GDB 9?) - unknown role.  ||
    259 || int32 || number of rows, included deleted rows ||
    260 || 4 bytes || 0x05 0x00 0x00 0x00 - unknown role. Constant among the files. Kind of signature ? ||
    261 
    262 == Offset section ==
    263 
    264 The section starts immediately after the header (at offset 16) and is made of 5 x number_rows bytes. For each row,
    265 
    266 ||'''Format'''||'''Content'''||
    267 || int32 || offset of the beginning of the row in the .gdbtable file, or 0 if the row is deleted ||
    268 || ubyte || constant to 0. unknown role ||
    269 
    270 == Padding section ==
    271 
    272 A lot of bytes to 0.
    273 
    274 == Trailing section ==
    275 
    276 The last few bytes look like 00 00 00 00 X 00 00 00 X 00 00 00 00 00 00 00 where X is non 0 (often 1). Unknown role
    277 
    278 = License =
    279 
    280 This specification document is (C) 2013 Even Rouault and licensed under the [http://creativecommons.org/licenses/by-sa/3.0 CC-BY-SA 3.0] terms [[Image(http://i.creativecommons.org/l/by-sa/3.0/88x31.png, link=http://creativecommons.org/licenses/by-sa/3.0)]].
    281 
    282 Note: the scope of the copyrighted material does, of course, not extend onto any source or binary code derived from the specification, that may be licensed under the terms that their author may see fit.
     5The specification has moved onto [https://github.com/rouault/dump_gdbtable/wiki/FGDB-Spec github wiki]