Opened 3 weeks ago

Closed 28 hours ago

#5734 closed defect (fixed)

Wrong ST_EstimatedExtent result for geography column

Reported by: strk Owned by: pramsey
Priority: medium Milestone: PostGIS 3.4.3
Component: postgis Version: 3.4.x
Keywords: Cc:

Description

See this session:

strk=# select id, st_astext(g) from qgis_issue_30294;
 id |   st_astext    
----+----------------
  1 | POINT(-10 -10)
  2 | POINT(20 30)
(2 rows)

strk=# analyze qgis_issue_30294;
ANALYZE
strk=# select st_estimatedextent('qgis_issue_30294', 'g');
                              st_estimatedextent                               
-------------------------------------------------------------------------------
 BOX(0.813017427921295 -0.173346117138863,0.970626592636108 0.298534214496613)
(1 row)

The g column is of geography type. The estimated extent is plain wrong

Change History (18)

comment:1 by strk, 3 weeks ago

The index-mediated return is also NULL, even if an index is present:

strk=# \d qgis_issue_30294
                                   Table "public.qgis_issue_30294"
 Column |         Type          | Collation | Nullable |                   Default                    
--------+-----------------------+-----------+----------+----------------------------------------------
 id     | integer               |           | not null | nextval('qgis_issue_30294_id_seq'::regclass)
 g      | geography(Point,4326) |           |          | 
Indexes:
    "qgis_issue_30294_pkey" PRIMARY KEY, btree (id)
    "qgis_issue_30294_g_idx" gist (g)

strk=# select _postgis_index_extent('qgis_issue_30294', 'g');
 _postgis_index_extent 
-----------------------
 
(1 row)

comment:2 by strk, 3 weeks ago

This problem affects QGIS downstream, see https://github.com/qgis/QGIS/pull/57514#issuecomment-2124451905 - at the moment QGIS just "assumes" the extent of geography-based layers is always -180..180 -90..90 as it cannot trust PostGIS to tell

comment:3 by strk, 3 weeks ago

The estimate itself seems off:

DEBUG:  analyzing "public.qgis_issue_30294"
DEBUG:  "qgis_issue_30294": scanned 1 of 1 pages, containing 2 live rows and 0 dead rows; 2 rows in sample, 2 estimated total rows
DEBUG:  [gserialized_estimate.c:compute_gserialized_stats_mode:1419] compute_gserialized_stats called
DEBUG:  [gserialized_estimate.c:compute_gserialized_stats_mode:1420]  # sample_rows: 2
DEBUG:  [gserialized_estimate.c:compute_gserialized_stats_mode:1421]  estimate of total_rows: 2
DEBUG:  [gserialized_estimate.c:nd_box_from_gbox:602]  GBOX((0.96984631,-0.17101008,0),(0.96984637,-0.17101006,0))
DEBUG:  [gserialized_estimate.c:nd_box_from_gbox:602]  GBOX((0.81379765,0.29619813,0),(0.81379771,0.29619816,0))

I don't see how the above matches the actual boxes:

strk=# select g::geometry::box2d from qgis_issue_30294;
          g           
----------------------
 BOX(-10 -10,-10 -10)
 BOX(20 30,20 30)
(2 rows)

comment:4 by strk, 3 weeks ago

The gserialized_datum_get_gbox_p returns random data when the datum has no cached bbox:

DEBUG:  [gserialized_gist_2d.c:gserialized_datum_get_internals_p:491] gserialized has bbox ? 0
DEBUG:  [gserialized_gist_2d.c:gserialized_datum_get_internals_p:492] gserialized gpart size : 32
DEBUG:  [gserialized_gist_2d.c:gserialized_datum_get_internals_p:493] gserialized_max_header_size(): 52
DEBUG:  [gserialized_estimate.c:compute_gserialized_stats_mode:1464]  gserialized_datum_get_gbox_p returned box[0.813798 0.813798, 0.296198 0.296198]

comment:5 by strk, 3 weeks ago

according to logs the box is even calculated:

DEBUG:  [gserialized.c:gserialized_get_gbox_p:100]  gserialized_get_gbox_p: version is 1
DEBUG:  [gserialized2.c:lwgeom_from_gserialized2_buffer:1472] Got type 1 (Point), hasz=0 hasm=0 geodetic=1 hasbox=0
DEBUG:  [gserialized2.c:gserialized2_get_gbox_p:607] box is calculated
DEBUG:  [gserialized_estimate.c:compute_gserialized_stats_mode:1464]  gserialized_datum_get_gbox_p returned box[0.813798 0.813798, 0.296198 0.296198]

The culprit seems to be ptarray_calculate_gbox_geodetic

comment:6 by strk, 3 weeks ago

It looks like the lwgeom_calculate_gbox_geodetic function is called often by tests in liblwgeom/cunit/cu_geodetic.c but its results are never checked. Is a geography bounding box having a different meaning from a geometry one ?

comment:8 by pramsey, 3 weeks ago

If the problem is "how can QGIS get a bounds for the data set" that's a little fiddly. In general the answer to "how do I get a 'normal' box" is "just use geometry". But for estimated extent that's not really on the table, as the information we want is already in geocentrics. One could take all the corners of the geocentric box, convert them back to degrees and then take the geometry extent of the collection…

comment:9 by strk, 3 weeks ago

Estimating extent for QGIS is the source problem yes, I cannot tell what other problems may arise by having the same datatype mean different things. The next question is: do we have a function to perform that conversion of a GBOX from geodetic to cartesian ?

comment:10 by strk, 3 weeks ago

Answering myself: there's no function to convert from geodesic GBOX to non-geodetic one. At least, the cast from gegraphy to geometry *drops* the GBOX to trigger re-computation.

comment:11 by strk, 3 weeks ago

So next problem is how to tell, from just reading the stats, whether the ND_BOX is geocentric or not, because that seems dependent on whether the column being analyzed is of type geography or geometry, but there's no such indication in the ND_STATS structure to read, am I right ? Also, there doesn't seem to be a Z value for the geodetic ones, isn't that mandatory to convert between geocentric and not ?

strk=# select jsonb_pretty(_postgis_stats('qgis_issue_30294_geom', 'g')::jsonb -> 'extent');
{
    "max": [
        20.15,
        30.2
    ],
    "min": [
        -10.15,
        -10.2
    ]
}
strk=# select jsonb_pretty(_postgis_stats('qgis_issue_30294', 'g')::jsonb -> 'extent');
{
    "max": [
        0.970627,
        0.298534
    ],
    "min": [
        0.813017,
        -0.173346
    ]
}

comment:13 by strk, 2 days ago

Given the amount of changes in PR https://git.osgeo.org/gitea/postgis/postgis/pulls/205 I guess that work could not be backported to the 3.4 branch or earlier, what do you think Paul ? If so, would it be worth a smaller changeset to just return NULL from ST_EstimatedExtent when the column is a geography type ?

comment:14 by pramsey, 2 days ago

Well the geodetic box code is nicely self-contained… I could probably do a low-touch patch to 3.4 that would be applicable to 3.3 and earlier with a little work, now that I know what the fix actually is, and the exact places I need to touch.

comment:15 by pramsey, 2 days ago

Here's a working branch that should be the minimal changes needed for 3.4 https://github.com/pramsey/postgis/tree/stable-3.4-extent

comment:16 by Paul Ramsey <pramsey@…>, 28 hours ago

In 1d03645f/git:

Support geography estimated extents, references #5734

comment:17 by pramsey, 28 hours ago

In 15e18d33e/git, applied refactor and geography support to master branch.

comment:18 by pramsey, 28 hours ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.