Changes between Initial Version and Version 1 of rfc54_dataset_transactions


Ignore:
Timestamp:
Feb 23, 2015, 9:18:46 AM (9 years ago)
Author:
Even Rouault
Comment:

Initial version of RFC 54

Legend:

Unmodified
Added
Removed
Modified
  • rfc54_dataset_transactions

    v1 v1  
     1= RFC 54: Dataset transactions =
     2
     3Authors: Even Rouault [[BR]]
     4Contact: even dot rouault at spatialys.com [[BR]]
     5Status: Development
     6
     7== Summary ==
     8
     9This RFC introduces an API to offer a transaction mechanism at dataset level and
     10uses it in the PostgreSQL, SQLite and GPKG drivers. It also reworks
     11significantly how transactions are handled in the PostgreSQL driver.
     12It also introduces a generic mechanism to implement an emulation of transactions
     13for datasources that would not natively support it, and uses it in the FileGDB
     14driver.
     15
     16== Rationale ==
     17
     18The current abstraction offers a transaction API at the layer level. However,
     19this is generally misleading since, when it is implemented in DBMS with
     20BEGIN/COMMIT/ROLLBACK sql statements (PostgreSQL, SQLite, GPKG, PGDump, MSSQLSpatial),
     21the semantics is really a transaction at database level that spans over all
     22layers/tables. So even if calling StartTransaction() on a layer, it also extends
     23on the changes done on other layers. In a very few drivers
     24StartTransaction()/CommitTransaction() is sometimes used as a mechanism to do
     25bulk insertion. This is for example the case of WFS, CartoDB, GFT, GME. For some
     26of them, it could rather be at dataset level too since potentially multiple layer
     27modifications could be stacked together.
     28
     29Furthermode some use cases require updating several layers consistently, hence
     30the need for a real database level transaction abstraction.
     31
     32The current situation of various drivers is the following (some of the below
     33observations resulting from the analysis are kept mainly for the benefit of developers
     34that would need to work in the drivers) :
     35
     36=== PostgreSQL ===
     37
     38A few facts about cursors used to run GetNextFeature() requests:
     39
     40  * Cursors are needed for retrieval of huge amount of data without being
     41    memory bound.
     42  * Cursors need transactions to run.
     43  * Default cursors (WITHOUT HOLD) cannot be used outside of the transaction that
     44    created tem
     45  * You cannot modify the structure of a table while the transaction (yes, the
     46    transaction, not the cursor) is still active and if you do that on another
     47    connection, it hangs until the other connection commits or rollbacks)
     48  * Within a transaction, deleted/modified rows are only visible if they are done
     49    before declaring the cursor.
     50  * Cursors WITH HOLD: may be used outside of transaction but cause a copy of
     51    the table to be done --> bad for performance
     52
     53Current flaws are :
     54  * one cannot do interleaved layer reading (beyond the
     55    first 500 features fetched, can be easily seen with OGR_PG_CURSOR_PAGE=1) due to
     56    the underlying implicit transaction created to read layer A being closed when the
     57    reading of layer B starts.
     58  * GetFeature() flushes the current transaction and starts a new one to do a
     59    cursor SELECT. Which is unnecessary since we retrieve only one record
     60  * SetAttributeFilter() issues a ResetReading() which currently
     61    FlushSoftTransaction() the ongoing transaction. Can be annoying in a scenario
     62    with long update where you need transactional guarantee
     63
     64What works :
     65  * Transaction support at the layer level forwarded to datasource.
     66  * Interleaved writing works (even with copy mode)
     67
     68=== SQLite/GPKG ===
     69
     70  * Mechanisms used to read table content (sqlite3_prepare() / sqlite3_step())
     71    do not need transactions.
     72  * Step sees structure modifications (e.g. column addition) if run after
     73    prepared statement but before first step.
     74  * Step sees row modifications/additions as soon as they occur.
     75  * Transaction support at the layer level forwarded to datasource.
     76
     77=== MySQL ===
     78
     79  * Cannot do interleaved layer reading (reading in one layer resets the other
     80    reading) because of the use of mysql_use_result() that can work with one single
     81    request at a time. mysql_store_result() would be a solution but requires
     82    ingesting the whole result set into memory, which is inpractical for big layers.
     83  * step does not set row changes once the query has started (if done through another
     84    connection, because if done through ExecuteSQL() the long transaction is interrupted)
     85  * No transaction support
     86
     87=== OCI ===
     88
     89  * Interleaved layer reading works
     90  * Changes done after SELECT seem not to be seen.
     91  * No transaction support
     92
     93=== FileGDB ===
     94
     95  * Interleaved layer reading works
     96  * Changes done after SELECT seem not to be seen.
     97  * No transaction support
     98
     99
     100== Proposed changes ==
     101
     102=== GDALDataset changes ===
     103
     104The following methods are added to GDALDataset (and usable by OGRDataSource
     105which inherits from GDALDataset).
     106
     107{{{
     108/************************************************************************/
     109/*                           StartTransaction()                         */
     110/************************************************************************/
     111
     112/**
     113 \brief For datasources which support transactions, StartTransaction creates a transaction.
     114
     115 If starting the transaction fails, will return
     116 OGRERR_FAILURE. Datasources which do not support transactions will
     117 always return OGRERR_UNSUPPORTED_OPERATION.
     118
     119 Nested transactions are not supported.
     120 
     121 All changes done after the start of the transaction are definitely applied in the
     122 datasource if CommitTransaction() is called. They may be cancelled by calling
     123 RollbackTransaction() instead.
     124 
     125 At the time of writing, transactions only apply on vector layers.
     126 
     127 Datasets that support transactions will advertize the ODsCTransactions capability.
     128 Use of transactions at dataset level is generally prefered to transactions at
     129 layer level, whose scope is rarely limited to the layer from which it was started.
     130 
     131 In case StartTransaction() fails, neither CommitTransaction() or RollbackTransaction()
     132 should be called.
     133 
     134 If an error occurs after a successfull StartTransaction(), the whole
     135 transaction may or may not be implicitely cancelled, depending on drivers. (e.g.
     136 the PG driver will cancel it, SQLite/GPKG not). In any case, in the event of an
     137 error, an explicit call to RollbackTransaction() should be done to keep things balanced.
     138 
     139 By default, when bForce is set to FALSE, only "efficient" transactions will be
     140 attempted. Some drivers may offer an emulation of transactions, but sometimes
     141 with significant overhead, in which case the user must explicitely allow for such
     142 an emulation by setting bForce to TRUE. Drivers that offer emulated transactions
     143 should advertize the ODsCEmulatedTransactions capability (and not ODsCTransactions).
     144 
     145 This function is the same as the C function GDALDatasetStartTransaction().
     146
     147 @param bForce can be set to TRUE if an emulation, possibly slow, of a transaction
     148               mechanism is acceptable.
     149
     150 @return OGRERR_NONE on success.
     151 @since GDAL 2.0
     152*/
     153OGRErr GDALDataset::StartTransaction(CPL_UNUSED int bForce);
     154
     155
     156/************************************************************************/
     157/*                           CommitTransaction()                        */
     158/************************************************************************/
     159
     160/**
     161 \brief For datasources which support transactions, CommitTransaction commits a transaction.
     162
     163 If no transaction is active, or the commit fails, will return
     164 OGRERR_FAILURE. Datasources which do not support transactions will
     165 always return OGRERR_UNSUPPORTED_OPERATION.
     166 
     167 Depending on drivers, this may or may not abort layer sequential readings that
     168 are active.
     169
     170 This function is the same as the C function GDALDatasetCommitTransaction().
     171
     172 @return OGRERR_NONE on success.
     173 @since GDAL 2.0
     174*/
     175OGRErr GDALDataset::CommitTransaction();
     176
     177/************************************************************************/
     178/*                           RollbackTransaction()                      */
     179/************************************************************************/
     180
     181/**
     182 \brief For datasources which support transactions, RollbackTransaction will roll
     183 back a datasource to its state before the start of the current transaction.
     184
     185 If no transaction is active, or the rollback fails, will return 
     186 OGRERR_FAILURE. Datasources which do not support transactions will
     187 always return OGRERR_UNSUPPORTED_OPERATION.
     188
     189 This function is the same as the C function GDALDatasetRollbackTransaction().
     190
     191 @return OGRERR_NONE on success.
     192 @since GDAL 2.0
     193*/
     194OGRErr GDALDataset::RollbackTransaction();
     195}}}
     196
     197Note: in the GDALDataset class itself, those methods have an empty implementation
     198that returns OGRERR_UNSUPPORTED_OPERATION.
     199
     200Those 3 methods are mapped at the C level as :
     201{{{
     202OGRErr CPL_DLL GDALDatasetStartTransaction(GDALDatasetH hDS, int bForce);
     203OGRErr CPL_DLL GDALDatasetCommitTransaction(GDALDatasetH hDS);
     204OGRErr CPL_DLL GDALDatasetRollbackTransaction(GDALDatasetH hDS);
     205}}}
     206
     207Two news dataset capabilities are added :
     208  * ODsCTransactions: True if this datasource supports (efficient) transactions.
     209  * ODsCEmulatedTransactions: True if this datasource supports transactions through emulation.
     210
     211=== Emulated transactions ===
     212
     213A new function OGRCreateEmulatedTransactionDataSourceWrapper() is added for used
     214by drivers that do not natively support transactions but want an emulation of
     215them. It could potentially be adopted by any datasource whose data is supportted
     216by files/directories.
     217
     218{{{
     219/** Returns a new datasource object that adds transactional behaviour to an existing datasource.
     220 *
     221 * The provided poTransactionBehaviour object should implement driver-specific
     222 * behaviour for transactions.
     223 *
     224 * The generic mechanisms offered by the wrapper class do not cover concurrent
     225 * updates (though different datasource connections) to the same datasource files.
     226 *
     227 * There are restrictions on what can be accomplished. For example it is not
     228 * allowed to have a unreleased layer returned by ExecuteSQL() before calling
     229 * StartTransaction(), CommitTransaction() or RollbackTransaction().
     230 *
     231 * Layer structural changes are not allowed after StartTransaction() if the
     232 * layer definition object has been returned previously with GetLayerDefn().
     233 *
     234 * @param poBaseDataSource the datasource to which to add transactional behaviour.
     235 * @param poTransactionBehaviour an implementation of the IOGRTransactionBehaviour interface.
     236 * @param bTakeOwnershipDataSource whether the returned object should own the
     237 *                                 passed poBaseDataSource (and thus destroy it
     238 *                                 when it is destroyed itself).
     239 * @param bTakeOwnershipTransactionBehaviour whether the returned object should own
     240 *                                           the passed poTransactionBehaviour
     241 *                                           (and thus destroy it when
     242 *                                           it is destroyed itself).
     243 * @return a new datasource handle
     244 * @since GDAL 2.0
     245 */
     246OGRDataSource CPL_DLL* OGRCreateEmulatedTransactionDataSourceWrapper(
     247                                OGRDataSource* poBaseDataSource,
     248                                IOGRTransactionBehaviour* poTransactionBehaviour,
     249                                int bTakeOwnershipDataSource,
     250                                int bTakeOwnershipTransactionBehaviour);
     251}}}
     252
     253The definition of the IOGRTransactionBehaviour interface is the following:
     254{{{
     255/** IOGRTransactionBehaviour is an interface that a driver must implement
     256 *  to provide emulation of transactions.
     257 *
     258 * @since GDAL 2.0
     259 */
     260class CPL_DLL IOGRTransactionBehaviour
     261{
     262    public:
     263
     264        /** Start a transaction.
     265        *
     266        * The implementation may update the poDSInOut reference by closing
     267        * and reopening the datasource (or assigning it to NULL in case of error).
     268        * In which case bOutHasReopenedDS must be set to TRUE.
     269        *
     270        * The implementation can for example backup the existing files/directories
     271        * that compose the current datasource.
     272        *
     273        * @param poDSInOut datasource handle that may be modified
     274        * @param bOutHasReopenedDS output boolean to indicate if datasource has been closed
     275        * @return OGRERR_NONE in case of success
     276        */
     277       virtual OGRErr StartTransaction(OGRDataSource*& poDSInOut,
     278                                       int& bOutHasReopenedDS) = 0;
     279
     280        /** Commit a transaction.
     281        *
     282        * The implementation may update the poDSInOut reference by closing
     283        * and reopening the datasource (or assigning it to NULL in case of error).
     284        * In which case bOutHasReopenedDS must be set to TRUE.
     285        *
     286        * The implementation can for example remove the backup it may have done
     287        * at StartTransaction() time.
     288        *
     289        * @param poDSInOut datasource handle that may be modified
     290        * @param bOutHasReopenedDS output boolean to indicate if datasource has been closed
     291        * @return OGRERR_NONE in case of success
     292        */
     293       virtual OGRErr CommitTransaction(OGRDataSource*& poDSInOut,
     294                                        int& bOutHasReopenedDS) = 0;
     295
     296        /** Rollback a transaction.
     297        *
     298        * The implementation may update the poDSInOut reference by closing
     299        * and reopening the datasource (or assigning it to NULL in case of error).
     300        * In which case bOutHasReopenedDS must be set to TRUE.
     301        *
     302        * The implementation can for example restore the backup it may have done
     303        * at StartTransaction() time.
     304        *
     305        * @param poDSInOut datasource handle that may be modified
     306        * @param bOutHasReopenedDS output boolean to indicate if datasource has been closed
     307        * @return OGRERR_NONE in case of success
     308        */
     309       virtual OGRErr RollbackTransaction(OGRDataSource*& poDSInOut,
     310                                          int& bOutHasReopenedDS) = 0;
     311
     312        /** Ask a layer to re-"adopt" its previous feature definition object.
     313        *
     314        * The aim is that the user doesn't see visible object changes.
     315        *
     316        * This method is called if StartTransaction(), CommitTransaction() or
     317        * RollbackTransaction() has closed and reopened a datasource.
     318        *
     319        * The implementation os the method should drop the reference to feature
     320        * definition object it may have currently, assign the poFeatureDefn as
     321        * its current definition object, and take a new reference on it.
     322        *
     323        * The passed poFeatureDefn object is supposed to be the "same" as
     324        * the one the layer has currently (as verified with IsSame()) since
     325        * the emulated datasource wrapper prevents layer structural modifications
     326        * in cases where the user has already called GetLayerDefn()
     327        */
     328        virtual void   ReadoptOldFeatureDefn(OGRDataSource* poDS,
     329                                            OGRLayer* poLayer,
     330                                            OGRFeatureDefn* poFeatureDefn) = 0;   
     331};
     332}}}
     333
     334=== OPGRLayer changes ===
     335
     336At the OGRLayer level, the documentation of GetNextFeature() receives the
     337following additional information to clarify its semantics :
     338
     339{{{
     340Features returned by GetNextFeature() may or may not be affected by concurrent
     341modifications depending on drivers. A guaranteed way of seing modifications in
     342effect is to call ResetReading() on layers where GetNextFeature() has been called,
     343before reading again. Structural changes in layers (field addition, deletion, ...)
     344when a read is in progress may or may not be possible depending on drivers.
     345If a transaction is committed/aborted, the current sequential reading may or may
     346not be valid after that operation and a call to ResetReading() might be needed.
     347}}}
     348
     349=== PG driver changes ===
     350
     351Dataset level transactions have been implemented, and use of implicitely
     352created transactions reworked.
     353
     354Interleaved layer reading is now possible.
     355
     356GetFeature() has been modified to run without a cursor or a transaction, and
     357all other calls to transactions have been checked/modified to not reset
     358accidentally a transaction initiated by the user.
     359
     360Below the new behaviour as described in the updated drv_pg_advanced.html help
     361page :
     362
     363{{{
     364Efficient sequential reading in PostgreSQL requires to be done within a transaction
     365(technically this is a CURSOR WITHOUT HOLD).
     366So the PG driver will implicitely open such a transaction if none is currently
     367opened as soon as a feature is retrieved. This transaction will be released if
     368ResetReading() is called (provided that no other layer is still being read).
     369
     370If within such an implicit transaction, an explicit dataset level StartTransaction()
     371is issued, the PG driver will use a SAVEPOINT to emulate properly the transaction
     372behaviour while making the active cursor on the read layer still opened.
     373
     374If an explicit transaction is opened with dataset level StartTransaction()
     375before reading a layer, this transaction will be used for the cursor that iterates
     376over the layer. When explicitly committing or rolling back the transaction, the
     377cursor will become invalid, and ResetReading() should be issued again to restart
     378reading from the beginning.
     379
     380With the above rules, the below examples show the SQL instructions that are
     381run when using the OGR API in different scenarios.
     382
     383
     384lyr1->GetNextFeature()             BEGIN (implict)
     385                                   DECLARE cur1 CURSOR FOR SELECT * FROM lyr1
     386                                   FETCH 1 IN cur1
     387
     388lyr2->GetNextFeature()             DECLARE cur2 CURSOR  FOR SELECT * FROM lyr2
     389                                   FETCH 1 IN cur2
     390
     391lyr1->GetNextFeature()             FETCH 1 IN cur1
     392
     393lyr2->GetNextFeature()             FETCH 1 IN cur2
     394
     395lyr1->CreateFeature(f)             INSERT INTO cur1 ...
     396
     397lyr1->ResetReading()               CLOSE cur1
     398
     399lyr2->ResetReading()               CLOSE cur2
     400                                   COMMIT (implicit)
     401
     402
     403
     404ds->StartTransaction()             BEGIN
     405
     406lyr1->GetNextFeature()             DECLARE cur1 CURSOR FOR SELECT * FROM lyr1
     407                                   FETCH 1 IN cur1
     408
     409lyr2->GetNextFeature()             DECLARE cur2 CURSOR FOR SELECT * FROM lyr2
     410                                   FETCH 1 IN cur2
     411
     412lyr1->CreateFeature(f)             INSERT INTO cur1 ...
     413
     414lyr1->ResetReading()               CLOSE cur1
     415
     416lyr2->ResetReading()               CLOSE cur2
     417
     418ds->CommitTransaction()            COMMIT
     419
     420
     421
     422ds->StartTransaction()             BEGIN
     423
     424lyr1->GetNextFeature()             DECLARE cur1 CURSOR FOR SELECT * FROM lyr1
     425                                   FETCH 1 IN cur1
     426
     427lyr1->CreateFeature(f)             INSERT INTO cur1 ...
     428
     429ds->CommitTransaction()            CLOSE cur1 (implicit)
     430                                   COMMIT
     431
     432lyr1->GetNextFeature()             FETCH 1 IN cur1      ==> Error since the cursor was closed with the commit. Explicit ResetReading() required before
     433
     434
     435
     436lyr1->GetNextFeature()             BEGIN (implicit)
     437                                   DECLARE cur1 CURSOR FOR SELECT * FROM lyr1
     438                                   FETCH 1 IN cur1
     439
     440ds->StartTransaction()             SAVEPOINT savepoint
     441
     442lyr1->CreateFeature(f)             INSERT INTO cur1 ...
     443
     444ds->CommitTransaction()            RELEASE SAVEPOINT savepoint
     445
     446lyr1->ResetReading()               CLOSE cur1
     447                                   COMMIT (implicit)
     448
     449
     450Note: in reality, the PG drivers fetches 500 features at once. The FETCH 1
     451is for clarity of the explanation.
     452}}}
     453
     454It is recommended to do operations within explicit transactions for ease of mind
     455(some troubles fixing ogr_pg.py, but which does admitedly weird things like
     456reopening connections, which does not fly very well with 'implicit' transactions)
     457
     458=== GPKG and SQLite driver changes ===
     459
     460Dataset level transactions have been implemented. A few fixes made here and there
     461to avoid reseting accidentally a transaction initiated by the user.
     462
     463=== FileGDB driver changes ===
     464
     465The FileGDB driver uses the above described emulation to offer a transaction
     466mechanism. This works by backing up the
     467current state of a geodatabase when StartTransaction(force=TRUE) is called.
     468If the transaction is committed, the backup copy is destroyed. If the transaction
     469is rolled back, the backup copy is restored. So this might be costly when operating
     470on huge geodatabases.
     471Note that this emulation has an unspecified behaviour in case of concurrent updates
     472(with different connections in the same or another process).
     473
     474== SWIG bindings (Python / Java / C# / Perl) changes ==
     475
     476The following additions have been done :
     477  * Dataset.StartTransaction(int force=FALSE)
     478  * Dataset.CommitTransaction()
     479  * Dataset.RollbackTransaction()
     480  * ogr.ODsCTransactions constant
     481  * ogr.ODsCEmulatedTransactions constant
     482
     483== Utilities ==
     484
     485ogr2ogr now uses dataset transactions (instead of layer transactions) if
     486ODsCTransactions is advertized.
     487
     488== Documentation ==
     489
     490New/modified API are documented. MIGRATION_GUIDE.TXT updated with mention
     491to below compatibility issues.
     492
     493== Test Suite ==
     494
     495The test suite is extended to test
     496 * updated drivers: PG, GPKG, SQLite, FileGDB
     497 * use of database transactions by ogr2ogr
     498
     499== Compatibility Issues ==
     500
     501As described above, subtle behaviour changes can be observed with the PG driver,
     502related to implicit transactions that were flushed before and are no longer now,
     503but this should hopefully be restricted to non-typical use cases. So some cases that "worked" before might
     504no longer work, but the new behaviour should hopefully be more understandable.
     505
     506The PG and SQLite drivers could accept apparently nested calls to StartTransaction()
     507(at the layer level). This is no longer possible since they are now redirected
     508to dataset transactions, that explicitely do not support it.
     509
     510== Out of scope ==
     511
     512The following drivers that implement BEGIN/COMMIT/ROLLBACK could be later
     513enhanced to support dataset transactions: OCI, MySQL, MSSQLSpatial.
     514
     515GFT, CartoDB, WFS could also benefit for dataset transactions.
     516
     517VRT currently supports layer transactions (if the underlying dataset support it,
     518and excluding union layers). If dataset transaction were to be implemented, should
     519it consist in forwarding dataset transaction to source dataset(s) ? Implementation
     520might be complicated in case the same dataset is used by multiple sources,
     521but more fundamentally one cannot guarantee ACID on multiple datasets.
     522
     523== Related tickets ==
     524
     525A proposed revision on how transactions are implemented in the PG driver was
     526proposed a long time ago (https://trac.osgeo.org/gdal/ticket/1265) to solve
     527some of the above issues. The patch no longer applies but it is expected that
     528the changes done for this RFC cover the issues that the ticket wanted to address.
     529
     530== Implementation ==
     531
     532Implementation will be done by Even Rouault ([http://spatialys.com Spatialys]),
     533and sponsored by [http://www.linz.govt.nz/ LINZ (Land Information New Zealand)].
     534
     535The proposed implementation lies in the "rfc54_dataset_transactions" branch of the
     536https://github.com/rouault/gdal2/tree/rfc54_dataset_transactions repository.
     537
     538The list of changes: https://github.com/rouault/gdal2/compare/rfc54_dataset_transactions
     539
     540== Voting history ==
     541
     542TBD