| 1 | = RFC 66 : OGR random layer read/write capabilities = |
| 2 | |
| 3 | Author: Even Rouault[[BR]] |
| 4 | |
| 5 | Contact: even.rouault at spatialys.com[[BR]] |
| 6 | |
| 7 | Status: Development[[BR]] |
| 8 | |
| 9 | Target version: 2.2 |
| 10 | |
| 11 | == Summary == |
| 12 | |
| 13 | This RFC introduces a new API to be able to iterate over vector features at |
| 14 | dataset level, in addition to the existing capability of doing it at the |
| 15 | layer level. |
| 16 | The existing capability of writing features in layers in random order, that is |
| 17 | supported by most drivers with output capabilities, is formalized with a new |
| 18 | dataset capability flag. |
| 19 | |
| 20 | == Rationale == |
| 21 | |
| 22 | Some vector formats mix features that belong to different layers in an |
| 23 | interleaved way, which make the current feature iteration per layer rather |
| 24 | inefficient (this requires for each layer to read the whole file). |
| 25 | One example of such drivers is the OSM driver. For this driver, a hack had |
| 26 | been developped in the past to be able to use the OGRLayer::GetNextFeature() |
| 27 | method, but with a really particular semantics. See "Interleaved reading" |
| 28 | paragraph of http://gdal.org/drv_osm.html for more details. A similar need |
| 29 | arises with the development of a new driver, GMLAS (for GML Application Schemas), |
| 30 | that reads GML files with arbitrary element nesting, and thus can return them |
| 31 | in a apparent random order, because it works in a streaming way. |
| 32 | For example, let's consider the following simplified XML content : |
| 33 | {{{ |
| 34 | <A> |
| 35 | ... |
| 36 | <B> |
| 37 | ... |
| 38 | </B> |
| 39 | ... |
| 40 | </A> |
| 41 | }}} |
| 42 | The driver will be first able to complete the building of feature B before |
| 43 | emitting feature A. So when reading sequences of this pattern, the driver |
| 44 | will emit features in the order B,A,B,A,... |
| 45 | |
| 46 | == Changes == |
| 47 | |
| 48 | === C++ API === |
| 49 | |
| 50 | Two new methods are added at the GDALDataset level : |
| 51 | |
| 52 | GetNextFeature(): |
| 53 | |
| 54 | {{{ |
| 55 | /** |
| 56 | \brief Fetch the next available feature from this dataset. |
| 57 | |
| 58 | The returned feature becomes the responsibility of the caller to |
| 59 | delete with OGRFeature::DestroyFeature(). |
| 60 | |
| 61 | Depending on the driver, this method may return features from layers in a |
| 62 | non sequential way. This is what may happen when the |
| 63 | ODsCRandomLayerRead capability is declared (for example for the |
| 64 | OSM and GMLAS drivers). When datasets declare this capability, it is strongly |
| 65 | advised to use GDALDataset::GetNextFeature() instead of |
| 66 | OGRLayer::GetNextFeature(), as the later might have a slow, incomplete or stub |
| 67 | implementation. |
| 68 | |
| 69 | The default implementation, used by most drivers, will |
| 70 | however iterate over each layer, and then over each feature within this |
| 71 | layer. |
| 72 | |
| 73 | This method takes into account spatial and attribute filters set on layers that |
| 74 | will be iterated upon. |
| 75 | |
| 76 | The ResetReading() method can be used to start at the beginning again. |
| 77 | |
| 78 | Depending on drivers, this may also have the side effect of calling |
| 79 | OGRLayer::GetNextFeature() on the layers of this dataset. |
| 80 | |
| 81 | This method is the same as the C function GDALDatasetGetNextFeature(). |
| 82 | |
| 83 | @param ppoBelongingLayer a pointer to a OGRLayer* variable to receive the |
| 84 | layer to which the object belongs to, or NULL. |
| 85 | It is possible that the output of *ppoBelongingLayer |
| 86 | to be NULL despite the feature not being NULL. |
| 87 | @param pdfProgressPct a pointer to a double variable to receive the |
| 88 | percentage progress (in [0,1] range), or NULL. |
| 89 | On return, the pointed value might be negative if |
| 90 | determining the progress is not possible. |
| 91 | @param pfnProgress a progress callback to report progress (for |
| 92 | GetNextFeature() calls that might have a long duration) |
| 93 | and offer cancellation possibility, or NULL |
| 94 | @param pProgressData user data provided to pfnProgress, or NULL |
| 95 | @return a feature, or NULL if no more features are available. |
| 96 | @since GDAL 2.2 |
| 97 | */ |
| 98 | |
| 99 | OGRFeature* GDALDataset::GetNextFeature( OGRLayer** ppoBelongingLayer, |
| 100 | double* pdfProgressPct, |
| 101 | GDALProgressFunc pfnProgress, |
| 102 | void* pProgressData ) |
| 103 | }}} |
| 104 | |
| 105 | and ResetReading(): |
| 106 | |
| 107 | {{{ |
| 108 | /** |
| 109 | \brief Reset feature reading to start on the first feature. |
| 110 | |
| 111 | This affects GetNextFeature(). |
| 112 | |
| 113 | Depending on drivers, this may also have the side effect of calling |
| 114 | OGRLayer::ResetReading() on the layers of this dataset. |
| 115 | |
| 116 | This method is the same as the C function GDALDatasetResetReading(). |
| 117 | |
| 118 | @since GDAL 2.2 |
| 119 | */ |
| 120 | void GDALDataset::ResetReading(); |
| 121 | }}} |
| 122 | |
| 123 | === New capabilities === |
| 124 | |
| 125 | The following 2 new dataset capabilities are added : |
| 126 | {{{ |
| 127 | #define ODsCRandomLayerRead "RandomLayerRead" /**< Dataset capability for GetNextFeature() returning features from random layers */ |
| 128 | #define ODsCRandomLayerWrite "RandomLayerWrite " /**< Dataset capability for supporting CreateFeature on layer in random order */ |
| 129 | }}} |
| 130 | |
| 131 | === C API === |
| 132 | |
| 133 | The above 2 new methods are available in the C API with : |
| 134 | {{{ |
| 135 | OGRFeatureH CPL_DLL GDALDatasetGetNextFeature( GDALDatasetH hDS, |
| 136 | OGRLayerH* phBelongingLayer, |
| 137 | double* pdfProgressPct, |
| 138 | GDALProgressFunc pfnProgress, |
| 139 | void* pProgressData ) |
| 140 | |
| 141 | void CPL_DLL GDALDatasetResetReading( GDALDatasetH hDS ); |
| 142 | }}} |
| 143 | |
| 144 | == Discussion about a few design choices of the new API == |
| 145 | |
| 146 | Compared to OGRLayer::GetNextFeature(), GDALDataset::GetNextFeature() has a |
| 147 | few differences : |
| 148 | - it returns the layer which the feature belongs to. Indeed, there's no easy way |
| 149 | from a feature to know which layer it belongs too (since in the data model, |
| 150 | features can exist outside of any layer). One possibility would be to |
| 151 | correlate the OGRFeatureDefn* object of the feature with the one of the layer, |
| 152 | but that is a bit inconvenient to do (and theoretically, one could imagine |
| 153 | several layers sharing the same feature definition object, although this |
| 154 | probably never happen in any in-tree driver). |
| 155 | - even if the feature returned is not NULL, the returned layer might be NULL. |
| 156 | This is just a provision for now, since that cannot currently happen. This |
| 157 | could be interesting to address schema-less datasources where basically each |
| 158 | feature could have a different schema (GeoJSON for example) without really |
| 159 | belonging to a clearly identified layer. |
| 160 | - it returns a progress percentage. When using OGRLayer API, one has to count |
| 161 | the number of features returned with the total number returned by GetFeatureCount(). |
| 162 | For the use cases we want to address knowing quickly the total number of features |
| 163 | of the dataset is not doable. But knowing the position of the file pointer |
| 164 | regarding the total size of the size is easy. Hence the decision to make |
| 165 | GetNextFeature() return the progress percentage. Regarding the choice of the |
| 166 | range [0,1], this is to be consistent with the range accepted by GDAL progress |
| 167 | functions. |
| 168 | - it accepts a progress and cancellation callback. One could wonder why this is |
| 169 | needed given that GetNextFeature() is an "elementary" method and that it |
| 170 | can already returns the progress percentage. However, in some circumstances, |
| 171 | it might take a rather long time to complete a GetNextFeature() call. For |
| 172 | example in the case of the OSM driver, as an optimization you can ask the |
| 173 | driver to return features of a subset of layers. For example all layers except |
| 174 | nodes. But generally the nodes are at the beginning of the file, so before you |
| 175 | get the first feature, you have typically to process 70% of the whole file. In |
| 176 | the GMLAS driver, the first GetNextFeature() call is also the opportunity to |
| 177 | do a preliminary quick scan of the file to determine the SRS of geometry columns, |
| 178 | hence having progress feedback is welcome. |
| 179 | |
| 180 | The progress percentage output is redundant with the progress callback mechanism, |
| 181 | and the latter could be used to get the former, however it may be a bit convoluted. |
| 182 | It would require doing things like: |
| 183 | |
| 184 | {{{ |
| 185 | int MyProgress(double pct, const char* msg, void* user_data) |
| 186 | { |
| 187 | *(double*)user_data = pct; |
| 188 | return TRUE; |
| 189 | } |
| 190 | |
| 191 | myDS->GetNextFeature(&poLayer, MyProgress, &pct) |
| 192 | }}} |
| 193 | |
| 194 | |
| 195 | == SWIG bindings (Python / Java / C# / Perl) changes == |
| 196 | |
| 197 | GDALDatasetGetNextFeature is mapped as gdal::Dataset::GetNextFeature() and |
| 198 | GDALDatasetResetReading as gdal::Dataset::ResetReading(). |
| 199 | |
| 200 | Regarding gdal::Dataset::GetNextFeature(), currently only Python has been modified |
| 201 | to return both the feature and its belonging layer. Other bindings just return |
| 202 | the feature for now (would need specialized typemaps) |
| 203 | |
| 204 | == Drivers == |
| 205 | |
| 206 | The OSM and GMLAS driver are updated to implement the new API. |
| 207 | |
| 208 | Existing drivers that support ODsCRandomLayerWrite are updated to advertize it |
| 209 | (that is most drivers that have layer creation capabilities, with the exceptions |
| 210 | of KML, JML and GeoJSON). |
| 211 | |
| 212 | == Utilities == |
| 213 | |
| 214 | ogr2ogr / GDALVectorTranslate() is changed internally to remove the hack that |
| 215 | was used for the OSM driver to use the new API, when ODsCRandomLayerRead is |
| 216 | advertized. It checks if the output driver advertizes ODsCRandomLayerWrite, and |
| 217 | if it does not, emit a warning, but still goes on proceeding with the conversion |
| 218 | using random layer reading/writing. |
| 219 | |
| 220 | ogrinfo is extended to accept a -rl (for random layer) flag that instructs it |
| 221 | to use the GDALDataset::GetNextFeature() API. It was considered to use it |
| 222 | automatically when ODsCRandomLayerRead was advertized, but the output can be |
| 223 | quite... random and thus not very practical for the user. |
| 224 | |
| 225 | == Documentation == |
| 226 | |
| 227 | All new methods/functions are documented. |
| 228 | |
| 229 | == Test Suite == |
| 230 | |
| 231 | The specialized GetNextFeature() implementation of the OSM and GMLAS driver |
| 232 | is tested in their respective tests. The default implementation of |
| 233 | GDALDataset::GetNextFeature() is tested in the MEM driver tests. |
| 234 | |
| 235 | == Compatibility Issues == |
| 236 | |
| 237 | None for existing users of the C/C++ API. |
| 238 | |
| 239 | Since there is a default implementation, the new functions/methods can be safely |
| 240 | used on drivers that don't have a specialized implementation. |
| 241 | |
| 242 | The addition of the new virtual methods GDALDataset::ResetReading() and |
| 243 | GDALDataset::GetNextFeature() may cause issues for out-of-tree drivers that |
| 244 | would already use internally such method names, but with different semantics, |
| 245 | or signatures. We have encountered such issues with a few in-tree drivers, and |
| 246 | fixed them. |
| 247 | |
| 248 | == Implementation == |
| 249 | |
| 250 | The implementation will be done by Even Rouault, and is mostly triggered by |
| 251 | the needs of the new GMLAS driver (initial development funded by the European Earth |
| 252 | observation programme Copernicus). |
| 253 | |
| 254 | The proposed implementation is in https://github.com/rouault/gdal2/tree/gmlas_randomreadwrite |
| 255 | (commit: https://github.com/rouault/gdal2/commit/8447606d68b9fac571aa4d381181ecfffed6d72c) |
| 256 | |
| 257 | == Voting history == |
| 258 | |
| 259 | TBD |