Opened 15 years ago

Closed 5 years ago

#2702 closed defect (wontfix)

OGR PGeo driver does not read doubles and floats from the database with full precision

Reported by: jjr8 Owned by: warmerdam
Priority: normal Milestone: closed_because_of_github_migration
Component: OGR_SF Version: 1.5.0
Severity: normal Keywords: pgeo
Cc:

Description

The following demo used OGR 1.5 with Python as described in #2688. Because Python natively only supports the double data type, I am only demoing the problem for that type. It can be demoed for the float data type, but the explanation is overly complicated due to 64-bit to 32-bit casting. Rest assured the problem does apply to floats as well as doubles.

To trust the demo, you must first have confidence that Python displays IEEE-754 doubles with full precision. This can be demonstrated by having Python parse and display the numbers 1.2 and 1.1:

>>> 1.2
1.2
>>> 1.1
1.1000000000000001
>>> 1.1000000000000001
1.1000000000000001

IEEE-754 can represent 1.2 exactly, so Python prints 1.2. But it cannot represent 1.1 exactly, so the parser creates the closest possible value and then prints it. When you ask it to parse and display that value, it comes back unchanged.

The demo will write three numbers to a database and then try to read them back. Here are the numbers:

>>> 1.79769313486231
1.79769313486231
>>> 1.797693134862312
1.797693134862312
>>> 1.7976931348623153
1.7976931348623153

As you can see, IEEE-754 can represent them exactly, because they come back unchanged from the Python interpreter. They have 15, 16, and 17 decimal digits of precision.

First we write them to TestPoints table of the attached personal geodatabase. This table has just one row. The row has three fields with data type double, one field for each of our three values. We use ADODB to execute a SQL UPDATE command to update the row.

>>> from win32com.client import Dispatch
>>> conn = Dispatch('ADODB.Connection')
>>> conn.Open('Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\\TestPGeo2\\MyGeoDB2.mdb')
>>> rs, num = conn.Execute('UPDATE TestPoints SET FDouble1 = 1.79769313486231, FDouble2 = 1.797693134862312, FDouble3 = 1.7976931348623153')
>>> num
1
>>> conn.Close()
>>> del rs, conn
>>> 

To validate that the values were written correctly and can be read back correctly, we use ADO to read them:

>>> conn = Dispatch('ADODB.Connection')
>>> conn.Open('Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\\TestPGeo2\\MyGeoDB2.mdb')
>>> rs, num = conn.Execute('SELECT * FROM TestPoints')
>>> rs.Fields.Item('FDouble1').Value
1.79769313486231
>>> rs.Fields.Item('FDouble2').Value
1.797693134862312
>>> rs.Fields.Item('FDouble3').Value
1.7976931348623153
>>> rs.Close(); conn.Close(); del rs, conn
>>> 

You can see that the values came back exactly as they were written. To verify that ArcGIS provides the same fidelity, we read them using the ArcGIS cursor APIs:

>>> gp = Dispatch('esriGeoprocessing.GPDispatch')
>>> cur = gp.SearchCursor('C:\\TestPGeo2\\MyGeoDB2.mdb\\TestPoints')
>>> row = cur.Next()
>>> row.GetValue('FDouble1')
1.79769313486231
>>> row.GetValue('FDouble2')
1.797693134862312
>>> row.GetValue('FDouble3')
1.7976931348623153
>>> del row, cur, gp
>>> 

We are now pretty confident that the values were stored properly and can be retrieved accurately. (It is quite possible ArcGIS uses ADO internally, so the second test might not boost confidence that much, but that doesn't really matter.)

Now we try OGR:

>>> from osgeo import ogr
>>> ds = ogr.Open('C:\\TestPGeo2\\MyGeoDB2.mdb')
>>> layer = ds.GetLayerByName('TestPoints')
>>> feat = layer.GetNextFeature()
>>> feat.GetFieldAsDouble(feat.GetFieldIndex('FDouble1'))
1.79769313486231
>>> feat.GetFieldAsDouble(feat.GetFieldIndex('FDouble2'))
1.79769313486231
>>> feat.GetFieldAsDouble(feat.GetFieldIndex('FDouble3'))
1.79769313486232
>>> del feat, layer, ds

OGR correctly returns the first value, but not the second or third. The other two appear to have been rounded to 15 decimal digits of precision. If you open the attached geodatabase with MS Access 2007, open the table, and widen the columns, you will see that Access displays the same values as OGR!

Now, one could argue that OGR's behavior is correct because it mimics the Access GUI. But I argue that is wrong. Personal geodatabases happen to use MS Access format (a.k.a. JET 4.0) but they are meant to be accessed through ArcGIS, not MS Access. The ArcGIS APIs are capable of reading doubles at full precision. OGR should provide the same capability.

I looked at GDAL's cpl_odbc.cpp, ogrpgeolayer.cpp, and so on. They are correct, as far as I can tell. I suspect the problem lies in the ODBC driver itself, and that Access is using the same driver. I searched the internet but could not find anything, but I only looked for five minutes.

I anticipate you guys will not want to fix this one, especially if it means redoing PGeo with ADO rather than ODBC, or something that drastic. While it would be optimal for PGeo to exhibit perfect fidelity with ArcGIS's own APIs, I probably agree that it is not worth a PGeo rewrite. 15 digits of precision is probably sufficient. But I'm still filing this bug so the problem is documented.

I only ask that if you have an ODBC expert on staff, could you please trace through CPLODBCStatement::Fetch() in cpl_odbc.cpp and validate that, in fact, you get back rounded values from SQLGetData(), and there is nothing you can do to make it give back fully-precise values?

Thanks!

BTW, the problem also occurs for the 32-bit float data type: it is rounded to 7 decimal digits (if I recall correctly).

Attachments (1)

TestPGeo2.zip (23.7 KB ) - added by jjr8 15 years ago.
Personal geodatabase used in the demonstration.

Download all attachments as: .zip

Change History (8)

by jjr8, 15 years ago

Attachment: TestPGeo2.zip added

Personal geodatabase used in the demonstration.

comment:1 by jjr8, 15 years ago

As an experiment, I opened the database with ADODB using the MS Access ODBC driver instead of the JET OLEDB driver. I was hoping to reproduce the problem, but, interestingly, it worked properly:

>>> conn = Dispatch('ADODB.Connection')
>>> conn.Open('Driver={Microsoft Access Driver (*.mdb)};Dbq=C:\\TestPGeo2\\MyGeoDB2.mdb')
>>> rs, num = conn.Execute('SELECT * FROM TestPoints')
>>> rs.Fields.Item('FDouble1').Value
1.79769313486231
>>> rs.Fields.Item('FDouble2').Value
1.797693134862312
>>> rs.Fields.Item('FDouble3').Value
1.7976931348623153

I'm not exactly sure how ADO DB layers itself on top of ODBC in this situation, although this can probably be determined by reading Microsoft docs.

comment:2 by Even Rouault, 15 years ago

Hum, interesting and deep issue...

I've reproduced the issue... on Linux with the standard - and unmaintened - version of mdbtools (the ODBC driver for MDB database that works with unixODBC). The reasion is that OGR query all datatypes as strings and the mdbtools driver does something like in src/libmdb/data.c :

text = g_strdup_printf("%.*f", DBL_DIG - floor_log10(td,0) - 1, td);

The effect of this is to truncate to 15 figures after point.

I changed the above line to g_strdup_printf("%.18f", td) and when running your OGR python script, I get back to the expected values :-)

So, my explanation would be that the Microsoft ODBC driver does something similar to the mdbtools driver when being requested to return the column values as strings instead of doubles. A full fix of medium complexity, avoiding round-trips of double/string/double, would be to change cpl_odbc.cpp and the PGeo driver to do their SQLGetData() queries witch a fetch type being double, when the data type is double, instead of querying the value as a string, and storing that value as a double. This would require changing CPLODBCStatement::Fetch(), adding a CPLODBCStatement::GetColDataAsDouble() and using it.

As a less ambitious fix, could you test replacing the first call to SQLGetData() in CPLODBCStatement::Fetch() by this :

        if (m_panColType[iCol] == SQL_DOUBLE || m_panColType[iCol] == SQL_FLOAT)
        {
            double dfValue = 0;
            nRetCode = SQLGetData( m_hStmt, iCol + 1, SQL_C_DOUBLE,
                        &dfValue, sizeof(dfValue), &cbDataLen );
            sprintf(szWrkData, "%.18f", dfValue);
            fprintf(stderr, "col %d, val = %s\n", iCol + 1, szWrkData);
            if( cbDataLen != SQL_NULL_DATA )
            {
                cbDataLen = strlen(szWrkData);
            }
        }
        else
        {
            nRetCode = SQLGetData( m_hStmt, iCol + 1, nFetchType,
                                szWrkData, sizeof(szWrkData)-1, 
                                &cbDataLen );
        }

This is totally untested code as the mdbtools driver doesn't support retrieving anything than text...

My explanation for your last comment where you are surprised that it works, is that when using directly the ODBC API, there's no round&trip with strings.

Side note, in case you would make further tests with OGR (no directly hit by your above python examples), you could run into another bug/feature, because OGRFeature::GetFieldAsString() does a sprintf(..., %.15g", ...)...

in reply to:  2 ; comment:3 by jjr8, 15 years ago

Replying to rouault:

So, my explanation would be that the Microsoft ODBC driver does something similar to the mdbtools driver when being requested to return the column values as strings instead of doubles.

Thanks for the detailed response. I believe you are correct. Somehow when I was reviewing cpl_odbc.cpp, I did not study this carefully enough:

        SQLSMALLINT nFetchType = GetTypeMapping( m_panColType[iCol] );

        // Handle values other than WCHAR and BINARY as CHAR.
        if( nFetchType != SQL_C_BINARY && nFetchType != SQL_C_WCHAR )
            nFetchType = SQL_C_CHAR;

I missed the fact that nFetchType was being overwritten with SQL_C_CHAR for all cases other than SQL_C_BINARY and SQL_C_WCHAR.

I would bet that you are correct about the Microsoft ODBC driver: that it is basically doing sprintf %.15f when you ask for a double as SQL_C_CHAR.

At first glance, I like your "less ambitious" fix. I think its a simple way to achieve the desired result without rearchitecting the class interface. If you go with it, I suggest you handle 32-bit float as a separate case, with an sprintf of something less than 18. I'm also not sure you should use %f. It might be better to use %e instead.

I went off and tried to find out what you should use for 32-bit float instead of 18. This gave me a big wake-up call on the issues of converting IEEE-754 from binary to decimal. Click here for a detailed example. I now think the best long term solution is to have something like a CPLODBCStatement::GetColDataAsDouble() function which just returns the binary obtained from the database, without a risky binary-->string-->binary conversion. If this is not possible, it is probably an improvement to go with your less ambitious fix, but I'm sorry I can't suggest a good replacement for 18 when doing the 32-bit case.

in reply to:  3 comment:4 by jjr8, 15 years ago

And I retract my statement that "Python displays IEEE-754 doubles with full precision". Based on my reading about IEEE-754 binary --> decimal conversion, I am not sure what Python is doing.

comment:5 by warmerdam, 15 years ago

Milestone: 1.6.0

I just wanted to note that we should be quite careful about changes in cpl_odbc.cpp at this late point in the release. In particular, I think it is important that we not change to requesting double precision fields in binary instead of ASCII as I *suspect* that it will cause compatability problems with some ODBC drivers.

I have removed the 1.6.0 milestone for that reason.

comment:6 by Jukka Rahkonen, 9 years ago

It seems that /trunk/gdal/port/cpl_odbc.cpp is still reading doubles as strings (GDAL 2.0). This is just a reminder, I can't say if it is good or bad. If it is OK I hope that someone closes this ticket as "wontfix".

comment:7 by Even Rouault, 5 years ago

Milestone: closed_because_of_github_migration
Resolution: wontfix
Status: newclosed

This ticket has been automatically closed because Trac is no longer used for GDAL bug tracking, since the project has migrated to GitHub. If you believe this ticket is still valid, you may file it to https://github.com/OSGeo/gdal/issues if it is not already reported there.

Note: See TracTickets for help on using tickets.