Opened 14 years ago

Closed 14 years ago

#3247 closed defect (fixed)

Shapefile driver truncated field names, sometimes causing duplicates

Reported by: Daniel Morissette Owned by: chaitanya
Priority: normal Milestone: 1.7.0
Component: OGR_SF Version: unspecified
Severity: blocker Keywords: shape
Cc: warmerdam, Even Rouault

Description

Since the DBF format limits field/attribute names to 10 chars, the shapefile driver has to truncate the name of fields longer than 10 chars. However, this is done without checking for potential duplicates field names.

For instance, if a dataset contains fields "mylongattr1" and "mylongattr2", they are both written as "mylongattr". We should change the dups to have a unique name, presumably by adding a unique number at the end.

I will attach a mif/mid dataset that can be used to reproduce the problem. Simply convert this file to shapfile using ogr2ogr and you see that the "POPPLACE_test1" and "POPPLACE_test2" fields end up being written as duplicate "POPPLACE_t" fields.

For an example of the right thing to do, see the pgsql2shp command: http://trac.osgeo.org/postgis/attachment/ticket/34/pgsql2shp.c#L2654

Attachments (1)

ticket3247-test.zip (905 bytes ) - added by Daniel Morissette 14 years ago.
test1.mif/mid dataset to reproduce the problem

Download all attachments as: .zip

Change History (15)

by Daniel Morissette, 14 years ago

Attachment: ticket3247-test.zip added

test1.mif/mid dataset to reproduce the problem

comment:1 by warmerdam, 14 years ago

Cc: warmerdam added
Keywords: shape added
Milestone: 1.7.0
Owner: changed from warmerdam to chaitanya

Chaitanya,

Please update the CreateField() method in the shapefile driver to watch for duplicate names. If the bApproxOK flag passed into CreateField() is false, duplicates should result in an error. If it is TRUE then apply appropriate name adjustment rules to avoid duplicates and issue a CE_Warning message via CPLError(), but return success from the method.

Also, please review existing tickets to see if there are any others in this vein about the shapefile driver.

comment:2 by Even Rouault, 14 years ago

Chaitanya,

after a bit of discussion with Frank on IRC, it appears that updating CreateField() is just one step towards the complete solution. Another step is needed to make it useful for ogr2ogr. Currently, ogr2ogr uses the OGRFeature::SetFrom() method to copy the fields of the source feature to the fields of the target feature. SetFrom() does strict field name matching, so it cannot set the content of a field whose name has been renamed.

Here's a simple example with the PG driver that already implements field name "laundering".

Content of test.csv:
acol#,dummycol
avalue,dummyvalue
 
$ ogr2ogr -overwrite -f postgresql PG:dbname=autotest test.csv
 
$ ogrinfo -ro -al PG:dbname=autotest test

INFO: Open of `PG:dbname=autotest'
      using driver `PostgreSQL' successful.

Layer name: test
Geometry: Unknown (any)
Feature Count: 1
Layer SRS WKT:
(unknown)
FID Column = ogc_fid
Geometry Column = wkb_geometry
acol_: String (0.0)
dummycol: String (0.0)
OGRFeature(test):1
  acol_ (String) = (null)
  dummycol (String) = dummyvalue

As you can see "acol#" has been renamed to "acol_", but ogr2ogr could not set its value.

A solution would be that ogr2ogr, just after calling CreateField(), queries the last added field definition to get the real field name set by the driver . It can then create a map from source field names to target field names. A new method OGRFeature::SetFromWithMap() should be created to take care of finding the appropriate target field for each source field. To ease maintenance, I think SetFrom() should just call SetFromWithMap() with a NULL map.

(Note: this will work fine for creation/overwriting of new layers, which will be a big improvement. However, we must be aware that for update of existing layers, we have currently no way of knowing how the driver has laundered the names. That corner case can be worked around by using the AS operator in OGR SQL statements or the new OGR VRT field capability for those situations)

in reply to:  2 comment:3 by chaitanya, 14 years ago

Cc: Even Rouault added
Status: newassigned

Replying to rouault:

A solution would be that ogr2ogr, just after calling CreateField(), queries the last added field definition to get the real field name set by the driver . It can then create a map from source field names to target field names. A new method OGRFeature::SetFromWithMap() should be created to take care of finding the appropriate target field for each source field. To ease maintenance, I think SetFrom() should just call SetFromWithMap() with a NULL map.

How about a map from source field index to target field index? It would be a performance boost when adding several features.

comment:4 by chaitanya, 14 years ago

Fixed the CreateField in shapefile driver in trunk. (r18211)

comment:5 by Even Rouault, 14 years ago

Chaitanya,

yes, a map from source field index to target field index could perform better by avoiding field name lookups.

comment:6 by chaitanya, 14 years ago

Resolution: fixed
Status: assignedclosed

Added the index-to-index map in trunk in r18226.

comment:7 by Even Rouault, 14 years ago

Resolution: fixed
Status: closedreopened

Chaintanya,

Here are some more changes related to that ticket :

  • r18231 /trunk/gdal/ (apps/ogr2ogr.cpp ogr/ogrfeature.cpp): Fix array overrun and memory leaks introduced by r18226 (#3247)
  • r18232 /trunk/gdal/ogr/ogrsf_frmts/shape/ogrshapelayer.cpp: Shape driver : avoid buffer overrrun when there are duplicate small field names; don't alter the field defn object passed to CreateField() (#3247)
  • r18233 /trunk/gdal/apps/ogr2ogr.cpp: ogr2ogr : take into account the fact that field creation can fails when building the map; fix the -append case (#3247)
  • r18234 /trunk/autotest/utilities/test_ogr2ogr.py: Test the values of the fields in the output of the ogr2ogr command

Please add tests to check at least that :

  • long field names are properly truncated by OGRShapeLayer::CreateField() and that no duplicate names can result from the truncation (in autotest/ogr/ogr_shape.py)
  • ogr2ogr correctly remaps fields when it launders field names (in autotest/utilities/test_ogr2ogr.py)

comment:8 by Even Rouault, 14 years ago

One more fix that was necessary for some drivers like GPX... or DXF :

r18236 /trunk/gdal/apps/ogr2ogr.cpp: ogr2ogr: take into account the fact that the fields can have been created at layer creation (#3247)

comment:9 by chaitanya, 14 years ago

Added a description of this functionality to the docs in the trunk. r18372.

comment:10 by chaitanya, 14 years ago

Resolution: fixed
Status: reopenedclosed

Added test for duplicate field names in shapefile driver. Added test for field remap in ogr2ogr. r18390 in trunk.

Added docs to represent the changes. r18391 in trunk.

At present ogr2ogr is ignoring duplicate fields. If the -select field_list option is given, the first of the duplicates is used; otherwise, the last of the duplicates is used.

comment:11 by Even Rouault, 14 years ago

Resolution: fixed
Severity: normalblocker
Status: closedreopened

Chaintanya,

r18390 breaks on a -DDEBUG build. See http://buildbot.osgeo.org:8500/builders/telascience-quick/builds/442/steps/test/logs/stdio

There's a CPLAssert in the GML driver that checks that there are no duplicate field names. I guess this is an invalid situation for a GML file to have duplicate field names (and many others formats, in particular RDBMS wouldn't allow that), but this check might be better handled otherwise than with a CPLAssert that causes crashes (and this would likely be worth another ticket).

To fix quickly the current situation for the test_ogr2ogr_20() test, I'd suggest you to use a CSV file with one single record instead of a GML file and avoid putting duplicate field names in the CSV file (remove also the empty field names. This is clearly invalid and whatever the current behaviour is currently, it is undefined and we should not test any expected result for that). I'd consider this situation as invalid and the behaviour of previous GDAL versions pretty much undefined for those situations. The test_ogr2ogr_20() function should not only test that the field names are properly laundered as you did, but also that the content of a record of the CSV file has been properly transfered in the laundered field names to check that the SetFromWithRemap() works well. Just 'a,b,c,d,e,...' for the content will do ;-)

Please fix it ASAP so we have a test suite that passes for 1.7.0 release.

comment:12 by chaitanya, 14 years ago

Modified the test in r18392 in trunk.

ogr2ogr is skipping a field when the laundered form of a field name is already present in the input layer. The original field is lost. This is an effect of r18236.

comment:13 by Even Rouault, 14 years ago

Chaitanya, note that I've commited in r18393 a test case that would not work without r18236. We need to query for potentially already existing field names and not blindly assumes that CreateField() will result in an extra field to be added. That's the case for drivers like GPX, GPSTrackMaker, DXF and maybe others.

comment:14 by chaitanya, 14 years ago

Resolution: fixed
Status: reopenedclosed

r18406 in trunk will test that the data goes into the correct field.

Note: See TracTickets for help on using tickets.