Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#5906 closed enhancement (fixed)

CartoDB faster progressive reading

Reported by: pramsey Owned by: Even Rouault
Priority: normal Milestone: 2.0.0
Component: OGR_SF Version: svn-trunk
Severity: normal Keywords:
Cc:

Description

Lacking the concept of a "cursor" the ability of the driver to "fetch" a certain number of rows easily is difficult, and made worse by the possibility that the data change while the act of reading is underway.

Currently the regime is: SELECT .... FROM ... WHERE ... LIMIT ... OFFSET

I have confirmed that OFFSET will simply scan down the query an appropriate number of rows, there is no use made of any smarts to make it faster. So on my test of a 2M record table, offsetting by 0 I get a very fast result, offsetting by 1M I get a result in about 400ms, offsetting by 2M I get a result in 800ms.

Experimenting a bit with the cartodb_id, getting 500 records is very fast indeed (SELECT * FROM mytable WHERE cartodb_id BETWEEN 1500500 AND 1501000). Using the primary key only makes clear what is already implicit: we're hoping the data don't change during extraction.

Change History (11)

comment:1 by Even Rouault, 9 years ago

We probably need a ORDER BY cartodb_id to have a consistant ordering when repeating operations (although the OGR API doesn't really guarantee increasing FIDs when using GetNextFeature(), and there are certainly drivers where FID can arrive out of order)

Regarding your proposal, I'm wondering what happens in case there's a huge hole in cartodb_id numbering. Probably not a typical case, but could happen. What about "SELECT * FROM mytable WHERE cartodb_id >= 1500500 ORDER BY cartodb_id LIMIT 500" ?

comment:2 by pramsey, 9 years ago

hmmm... I guess you'll get sorting from the index so the LIMIT should short-circuit and avoid requiring a full table sort. If there's a huge hole you'll just get an empty result set and have to move on to the next one. I guess your "floor and a limit" approach is no worse than any others, since the real problem is not having access to a consistent snapshot

comment:3 by Even Rouault, 9 years ago

Perhaps in case SELECT.... BETWEEN x and x + 500 would return an empty set, a not so bad option would be to ask for the MIN(cartodb_id) WHERE cartodb_id > x, and starts next interval with the returned id.

comment:4 by pramsey, 9 years ago

I'm not sure that Min() would result in an index scan necessarily... running some tests though, it looks like it does work.

comment:5 by javisantana, 9 years ago

the only problem I see if when the table is not actually a cartodbfied table (not having a cartodb_id)

also for the case of the hole, you could use the histogram from postgres stats:

# create index on testing (cartodb_id); SELECT 100000 # delete from testing where cartodb_id BETWEEN 20000 and 50000; DELETE 30001 cartodb_dev_user_b7861beb-21d4-46bc-ae00-5f09f77701a6_db=# create index on testing (cartodb_id); CREATE INDEX # select histogram_bounds from pg_stats where tablename = 'testing' and attname = 'cartodb_id'; {1,742,1499,2197,2841,3643,4332,4988,5670,6370,7092,7795,8494,9176,9862,10565,11302,12002,12714,13403,14053,14736,15452,16150,16848,17500,18215,18920,19606,50319,50992,51713,52420,53103,53784,54559,55216,55968,56656,57375,58061,58784,59508,60235,60930,61589,62272,63054,63726,64392,65049,65725,66390,67084,67760,68472,69202,69950,70607,71293,72036,72699,73388,74119,74828,75539,76219,76932,77616,78307,78989,79768,80419,81127,81761,82472,83146,83840,84535,85170,85861,86566,87287,87956,88651,89376,90101,90794,91526,92320,92983,93696,94417,95090,95804,96459,97175,97914,98622,99311,100000}

comment:6 by javisantana, 9 years ago

sorry, formatting:

# create index on testing (cartodb_id); 
SELECT 100000 
# delete from testing where cartodb_id BETWEEN 20000 and 50000; 
DELETE 30001
# create index on testing (cartodb_id); 
CREATE INDEX 
# select histogram_bounds from pg_stats where tablename = 'testing' and attname = 'cartodb_id'; 

{1,742,1499,2197,2841,3643,4332,4988,5670,6370,7092,7795,8494,9176,9862,10565,11302,12002,12714,13403,14053,14736,15452,16150,16848,17500,18215,18920,19606,50319,50992,51713,52420,53103,53784,54559,55216,55968,56656,57375,58061,58784,59508,60235,60930,61589,62272,63054,63726,64392,65049,65725,66390,67084,67760,68472,69202,69950,70607,71293,72036,72699,73388,74119,74828,75539,76219,76932,77616,78307,78989,79768,80419,81127,81761,82472,83146,83840,84535,85170,85861,86566,87287,87956,88651,89376,90101,90794,91526,92320,92983,93696,94417,95090,95804,96459,97175,97914,98622,99311,100000}

comment:7 by pramsey, 9 years ago

Could end up w/ multiple modes (gah, the horror!): a primary key mode that scans along the primary key; and a keyless mode that continues to use the LIMIT/OFFSET approach. Since there's no guarantee of a key I see no way to avoid retaining the LIMIT/OFFSET mode for some cases.

comment:8 by Even Rouault, 9 years ago

Milestone: 1.11.32.0
Resolution: fixed
Status: newclosed

trunk r28996 "CartoDB: in authenticated mode, retrieve all column information, including spatial info, default value and primary key in one single statement (related to #5906)"

trunk r28997 "CartoDB: use integer primary key of tables, when available, to scroll faster among features instead of using OFFSET/LIMIT (#5906)"

comment:9 by Even Rouault, 9 years ago

trunk r29000 "CartoDB: use more efficient scrolling with primary key by using 'WHERE primary_key >= next_id ORDER BY primary_key LIMIT page_size' (#5906)"

comment:10 by Even Rouault, 9 years ago

trunk r29003 "CartoDB: faster feature retrieval by not querying columns we ignore (the_geom_webmercator, created_at, updated_at) (#5906)"

Last edited 9 years ago by Even Rouault (previous) (diff)

comment:11 by Even Rouault, 9 years ago

Milestone: 2.02.0.0

Milestone renamed

Note: See TracTickets for help on using tickets.