Opened 14 years ago
Closed 14 years ago
#3247 closed defect (fixed)
Shapefile driver truncated field names, sometimes causing duplicates
Reported by: | Daniel Morissette | Owned by: | chaitanya |
---|---|---|---|
Priority: | normal | Milestone: | 1.7.0 |
Component: | OGR_SF | Version: | unspecified |
Severity: | blocker | Keywords: | shape |
Cc: | warmerdam, Even Rouault |
Description
Since the DBF format limits field/attribute names to 10 chars, the shapefile driver has to truncate the name of fields longer than 10 chars. However, this is done without checking for potential duplicates field names.
For instance, if a dataset contains fields "mylongattr1" and "mylongattr2", they are both written as "mylongattr". We should change the dups to have a unique name, presumably by adding a unique number at the end.
I will attach a mif/mid dataset that can be used to reproduce the problem. Simply convert this file to shapfile using ogr2ogr and you see that the "POPPLACE_test1" and "POPPLACE_test2" fields end up being written as duplicate "POPPLACE_t" fields.
For an example of the right thing to do, see the pgsql2shp command: http://trac.osgeo.org/postgis/attachment/ticket/34/pgsql2shp.c#L2654
Attachments (1)
Change History (15)
by , 14 years ago
Attachment: | ticket3247-test.zip added |
---|
comment:1 by , 14 years ago
Cc: | added |
---|---|
Keywords: | shape added |
Milestone: | → 1.7.0 |
Owner: | changed from | to
Chaitanya,
Please update the CreateField() method in the shapefile driver to watch for duplicate names. If the bApproxOK flag passed into CreateField() is false, duplicates should result in an error. If it is TRUE then apply appropriate name adjustment rules to avoid duplicates and issue a CE_Warning message via CPLError(), but return success from the method.
Also, please review existing tickets to see if there are any others in this vein about the shapefile driver.
follow-up: 3 comment:2 by , 14 years ago
Chaitanya,
after a bit of discussion with Frank on IRC, it appears that updating CreateField() is just one step towards the complete solution. Another step is needed to make it useful for ogr2ogr. Currently, ogr2ogr uses the OGRFeature::SetFrom() method to copy the fields of the source feature to the fields of the target feature. SetFrom() does strict field name matching, so it cannot set the content of a field whose name has been renamed.
Here's a simple example with the PG driver that already implements field name "laundering".
Content of test.csv: acol#,dummycol avalue,dummyvalue $ ogr2ogr -overwrite -f postgresql PG:dbname=autotest test.csv $ ogrinfo -ro -al PG:dbname=autotest test INFO: Open of `PG:dbname=autotest' using driver `PostgreSQL' successful. Layer name: test Geometry: Unknown (any) Feature Count: 1 Layer SRS WKT: (unknown) FID Column = ogc_fid Geometry Column = wkb_geometry acol_: String (0.0) dummycol: String (0.0) OGRFeature(test):1 acol_ (String) = (null) dummycol (String) = dummyvalue
As you can see "acol#" has been renamed to "acol_", but ogr2ogr could not set its value.
A solution would be that ogr2ogr, just after calling CreateField(), queries the last added field definition to get the real field name set by the driver . It can then create a map from source field names to target field names. A new method OGRFeature::SetFromWithMap() should be created to take care of finding the appropriate target field for each source field. To ease maintenance, I think SetFrom() should just call SetFromWithMap() with a NULL map.
(Note: this will work fine for creation/overwriting of new layers, which will be a big improvement. However, we must be aware that for update of existing layers, we have currently no way of knowing how the driver has laundered the names. That corner case can be worked around by using the AS operator in OGR SQL statements or the new OGR VRT field capability for those situations)
comment:3 by , 14 years ago
Cc: | added |
---|---|
Status: | new → assigned |
Replying to rouault:
A solution would be that ogr2ogr, just after calling CreateField(), queries the last added field definition to get the real field name set by the driver . It can then create a map from source field names to target field names. A new method OGRFeature::SetFromWithMap() should be created to take care of finding the appropriate target field for each source field. To ease maintenance, I think SetFrom() should just call SetFromWithMap() with a NULL map.
How about a map from source field index to target field index? It would be a performance boost when adding several features.
comment:5 by , 14 years ago
Chaitanya,
yes, a map from source field index to target field index could perform better by avoiding field name lookups.
comment:6 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Added the index-to-index map in trunk in r18226.
comment:7 by , 14 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Chaintanya,
Here are some more changes related to that ticket :
- r18231 /trunk/gdal/ (apps/ogr2ogr.cpp ogr/ogrfeature.cpp): Fix array overrun and memory leaks introduced by r18226 (#3247)
- r18232 /trunk/gdal/ogr/ogrsf_frmts/shape/ogrshapelayer.cpp: Shape driver : avoid buffer overrrun when there are duplicate small field names; don't alter the field defn object passed to CreateField() (#3247)
- r18233 /trunk/gdal/apps/ogr2ogr.cpp: ogr2ogr : take into account the fact that field creation can fails when building the map; fix the -append case (#3247)
- r18234 /trunk/autotest/utilities/test_ogr2ogr.py: Test the values of the fields in the output of the ogr2ogr command
Please add tests to check at least that :
- long field names are properly truncated by OGRShapeLayer::CreateField() and that no duplicate names can result from the truncation (in autotest/ogr/ogr_shape.py)
- ogr2ogr correctly remaps fields when it launders field names (in autotest/utilities/test_ogr2ogr.py)
comment:8 by , 14 years ago
comment:9 by , 14 years ago
Added a description of this functionality to the docs in the trunk. r18372.
comment:10 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Added test for duplicate field names in shapefile driver. Added test for field remap in ogr2ogr. r18390 in trunk.
Added docs to represent the changes. r18391 in trunk.
At present ogr2ogr is ignoring duplicate fields. If the -select field_list option is given, the first of the duplicates is used; otherwise, the last of the duplicates is used.
comment:11 by , 14 years ago
Resolution: | fixed |
---|---|
Severity: | normal → blocker |
Status: | closed → reopened |
Chaintanya,
r18390 breaks on a -DDEBUG build. See http://buildbot.osgeo.org:8500/builders/telascience-quick/builds/442/steps/test/logs/stdio
There's a CPLAssert in the GML driver that checks that there are no duplicate field names. I guess this is an invalid situation for a GML file to have duplicate field names (and many others formats, in particular RDBMS wouldn't allow that), but this check might be better handled otherwise than with a CPLAssert that causes crashes (and this would likely be worth another ticket).
To fix quickly the current situation for the test_ogr2ogr_20() test, I'd suggest you to use a CSV file with one single record instead of a GML file and avoid putting duplicate field names in the CSV file (remove also the empty field names. This is clearly invalid and whatever the current behaviour is currently, it is undefined and we should not test any expected result for that). I'd consider this situation as invalid and the behaviour of previous GDAL versions pretty much undefined for those situations. The test_ogr2ogr_20() function should not only test that the field names are properly laundered as you did, but also that the content of a record of the CSV file has been properly transfered in the laundered field names to check that the SetFromWithRemap() works well. Just 'a,b,c,d,e,...' for the content will do ;-)
Please fix it ASAP so we have a test suite that passes for 1.7.0 release.
comment:12 by , 14 years ago
comment:13 by , 14 years ago
Chaitanya, note that I've commited in r18393 a test case that would not work without r18236. We need to query for potentially already existing field names and not blindly assumes that CreateField() will result in an extra field to be added. That's the case for drivers like GPX, GPSTrackMaker, DXF and maybe others.
comment:14 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
r18406 in trunk will test that the data goes into the correct field.
test1.mif/mid dataset to reproduce the problem