Opened 15 years ago

Last modified 15 years ago

#839 closed defect (fixed)

[OGR] Attempt to use attribute index fails for large shp file

Reported by: jmckenna@… Owned by: Daniel Morissette
Priority: high Milestone:
Component: default Version: unspecified
Severity: normal Keywords:
Cc:

Description

The file is a 1.1 GB shapefile.  An attribute index file has been created on a
specific column (parcelid), creating .idm and .ind files. To test this attribute
index the following ogrinfo command is used:

$ ogrinfo SCAP_Cert2004_ParcelsUTM16M.shp SCAP_Cert2004_ParcelsUTM16M -sql
"SELECT * FROM SCAP_Cert2004_ParcelsUTM16M WHERE parcelid
= '093519 D00015'"

which returns:

INFO: Open of `SCAP_Cert2004_ParcelsUTM16M.shp'
using driver `ESRI Shapefile' successful.
layer names ignored in combination with -sql.

Layer name: SCAP_Cert2004_ParcelsUTM16M
Geometry: Polygon
ERROR 1: Attempt to read shape with feature id (703141887) out of available rang
e.

Here is the ogrinfo -summary output:

$ ogrinfo SCAP_Cert2004_ParcelsUTM16M.shp SCAP_Cert2004_ParcelsUTM16M -summary
INFO: Open of `SCAP_Cert2004_ParcelsUTM16M.shp'
using driver `ESRI Shapefile' successful.

Layer name: SCAP_Cert2004_ParcelsUTM16M
Geometry: Polygon
Feature Count: 332190
...


Here is the size of the files:


jeff     users        1.1G Apr 25 21:06 SCAP_Cert2004_ParcelsUTM16M.dbf
jeff     users         246 Apr 26 14:18 SCAP_Cert2004_ParcelsUTM16M.idm
jeff     users        742M Apr 26 15:38 SCAP_Cert2004_ParcelsUTM16M.ind
jeff     users        8.9M Apr 26 14:13 SCAP_Cert2004_ParcelsUTM16M.qix
jeff     users        294M Apr 25 20:32 SCAP_Cert2004_ParcelsUTM16M.shp
jeff     users        2.6M Apr 25 20:33 SCAP_Cert2004_ParcelsUTM16M.shx

Change History (6)

comment:1 Changed 15 years ago by Daniel Morissette

Reassigned to myself. Trying to reproduce and check in the debugger.

I initially thought this could be due to a limitation in the number of records
in a .IND file, but looking at the code, the Feature indices returned by the
functions are GInt32 so this shouldn't be a problem unless I made a mistake when
implementing the code. Maybe a limitation of the shapefile driver? I'll see once
I get in the debugger.

Note that we get a seg.fault on Linux after the "Attempt to read shape with
feature id (703141887) out of available range."

comment:2 Changed 15 years ago by Daniel Morissette

Quick update:

After discussing this with Frank, it seems that shapefile attribute indexes were
implemented only for integer fields, but not for string fields.

Unfortunately we do not have any integer field in this file to verify that this
is really the issue. Another possibility is that this could be a limitation of
the implementation of the .IND file format or of the format itself. I'll make a
few more tests to verify that.

comment:3 Changed 15 years ago by Daniel Morissette

okay, I have verified that I am able to use an index on an string field on a
shapefile in OGR and to query it, also verified that the index is indeed being
used. So the issue is not that string field indexes are not implemented. It must
be a limitation of the index format or of its implementation. Will dig further
later.

comment:4 Changed 15 years ago by Daniel Morissette

More good news: I have converted the dataset to MapInfo .TAB format (which uses
the same index stuff as the OGR shapefile index) and indexed on the "parcelid"
field and run a few tests using the MITAB utilities and the index works.

We're not there yet, but at least that confirms that the index file format and
its implementation can work on such a large dataset.

comment:5 Changed 15 years ago by Daniel Morissette

We've found that a good way to work around the issue was to reduce the size of
the parcelid field from 254 chars to 25 chars. This way the index works, and the
file has a more reasonable size. 

I finally identified the source of the problem: it's an overflow of a 1 byte
field in the header of the .IND file. The original file had string fields 254
chars long, which resulted in a 128 bytes key being used in the index. The index
uses 512 bytes nodes, so only 3 entries can fit in a node. That explains why the
indes ended up being so big. The field that overflows is the SubTreeDepth, i.e.
the depth of the tree. In this case the resuulting index is 551 levels deep, but
since the value is written in a byte, when we reopen the file we read 39... and
hence all the problems that we've seen.

I am a bit surprised to see the index going 551 levels deep, I would have
expected that for this number of records and the key size it should go no deeper
than 20-30 levels... I must have screwed up something in my calculations.

I won't spend any more time to fix this. What I'll do is add a test when writing
the header of the .IND file, and if there is an overflow you'll get the
following message:

ERROR 7: Index no 1 is too large and will not be useable. (SubTreeDepth = 551,
cannot exceed 255).

comment:6 Changed 15 years ago by Daniel Morissette

Marking Fixed. I have committed the test with the error message to the master
MITAB CVS and backported to the OGR CVS.
Note: See TracTickets for help on using tickets.