Opened 8 years ago

Closed 2 years ago

#3224 closed defect (fixed)

Problems with large raster sizes (WMS/TMS)

Reported by: tamas Owned by: warmerdam
Priority: normal Milestone: 2.1.0
Component: GDAL_Raster Version: unspecified
Severity: normal Keywords:
Cc: EliL

Description

I've been testing with the WMS/TMS driver at large zoom levels according to the following WMS XML file:

<GDAL_WMS>

<Service name="TMS">

<ServerUrl?>http://web0.nearmap.com/maps/hl=en&x=${x}&y=${y}&z=${z}&nml=Vert</ServerUrl?>

</Service> <DataWindow?>

<UpperLeftX>-20037508.34</UpperLeftX> <UpperLeftY>20037508.34</UpperLeftY> <LowerRightX>20037508.34</LowerRightX> <LowerRightY>-20037508.34</LowerRightY> <TileLevel?>20</TileLevel?> <TileCountX>1</TileCountX> <TileCountY>1</TileCountY> <YOrigin>top</YOrigin>

</DataWindow?> <Projection>EPSG:900913</Projection> <BlockSizeX>256</BlockSizeX> <BlockSizeY>256</BlockSizeY> <BandsCount?>3</BandsCount?> <Cache />

</GDAL_WMS>

# 1. By setting the TileLevel? to 19 everything is working fine. However when I increase the TileLevel? to 20 I always get the following error : "Out of memory in InitBlockInfo?()". This is beacause gdal tries to pre-allocate the pointers for all tiles in the internal memory cache. This issue can be handled by replacing the matrix based common block cache implementation with a hashtable based implementation.

# 2. When I set TileLevel? to 23 I get : 'Invalid dataset dimensions : -2147483648 x -2147483648'. In this case the raster size would exceed the INT_MAX limitation. As a quick and dirty solution we could limit the raster size to INT_MAX in the wms driver.

I've attached a patch to handle both cases.

Attachments (1)

gdal-tms.patch (15.7 KB) - added by tamas 8 years ago.

Download all attachments as: .zip

Change History (14)

Changed 8 years ago by tamas

Attachment: gdal-tms.patch added

comment:1 Changed 8 years ago by tamas

Milestone: 1.7.0

comment:2 Changed 8 years ago by nowakpl

I really don't like the use of doubles there, either replace it with 'unsigned long long' or uint64_t. Also silently 'fixing' the size will most likely leave users wondering wtf is wrong, erroring out would be a better option imo.

comment:3 Changed 8 years ago by Even Rouault

The implementation of the block cache is not thread-safe as it relies on CPLHashSet which is not thread-safe (it is reentrant, but two different threads cannot handle the same hash set at the same time). The issue is that the FlushBlock?() method can be - legally - called by a thread which is not the thread handling the dataset.

Scenario :

  • Thread 1 : opens a WMS dataset and reads a few blocks
  • Thread 1 : closes the WMS dataset, which calls FlushCache?(), that in turn calls CPLHashSetRemoveAll(). At the same time, Thread 2 : reads blocks from a GeoTIFF. The block cache gets saturated, so least recently used blocks from the WMS dataset are discarded and FlushBlock?() is called on them, which in turns call CPLHashSetRemove()

--> 2 threads are handling the same hash set object. Bang !

Similarly, the AdoptBlock?() and TryGetLockedBlockRef?() methods can be called at the same time as FlushBlock?()

So we probably need a per-band mutex to protect all accesses to the hash set.

comment:4 in reply to:  2 Changed 8 years ago by tamas

Replying to nowakpl:

I really don't like the use of doubles there, either replace it with 'unsigned long long' or uint64_t. Also silently 'fixing' the size will most likely leave users wondering wtf is wrong, erroring out would be a better option imo.

The double type could be replaced with GIntBig (64bit version signed int) as we probably won't run out of it's domain. I don't see much significant difference when using these intermediary types (double or GIntBig) for the purpose of the calculation.

I would not justify raising an error (like now) is a reasonable option here, which makes it impossible to use tile level 23 in the example above. It's much better when you can address all the tiles except for the last ones in the domain. It's pretty much similar to the case when you place 4GB of RAM into a Win32 machine and you can use only a part of the available memory because of the limited address space. The fact that the user can use a limited set of the tiles can be handled in the documentation as a warning.

comment:5 Changed 8 years ago by Even Rouault

Actually, I back out my suggestion of a per-band mutex to protect all accesses to the hash set. In the scenario I gave, we would have a dead-lock between the band mutex and the hRBMutex of gdalrasterblock.cpp :

  • Thread1 closes the WMS dataset, which calls FlushCache?(), that would take the band mutex and call CPLHashSetRemoveAll().
  • At the same time, Thread2 reads blocks from a GeoTIFF, thus calls GDALRasterBlock::Internalize() which takes the hRBMutex.
  • Thread2 (under the hRBMutex) will try to discard the least recently used blocks from the WMS dataset with FlushBlock?(), which will try to take the band mutex before calling CPLHashSetRemove() --> wait as thread 1 is under the band mutex
  • Thread1 (under the band mutex) in CPLHashSetRemoveAll() will delete blocks, but the destructor of a block tries to take the hRBMutex --> wait as thread 2 is under the hRBMutex --> dead lock

So it seems that accesses to the hash set can only be protected by the hRBMutex itself.

comment:6 in reply to:  5 Changed 8 years ago by tamas

Even,

I have an impression that this issue should be handled in gdalrasterblock.cpp internally, as it may be dedicated to the static variables inside this module. Looks like if poBand->FlushBlock? should be placed inside the quarded section in FlushCacheBlock?. However I wonder if this autoflush mechanism is currently working in gdal without explicit user intervention (FlushCacheBlock? is not called from anywhere).

The problem you described seems to be an issue with the current block cache implementation as well, though the critical sections of the current approach may be "smaller" than with the hashset implementation.

One more issue I would mention is that the default hash size (=53) may be too small for the large rasters (it should be increased by default) and frequent rehashing of the hashtable may cause some decrease in the overall performance.

It would also be reasonable that these 2 approaches would work in parallel and could be selectable (manually or automaticaly depending on the raster size). In this regard the hashset option could be implemented in GDALRasterBand which would use the original approach with rasters of smaller size.

comment:7 Changed 8 years ago by Even Rouault

Tamas,

In the scenario I gave in my previous comment, FlushBlock?() is called from FlushCacheBlock?() which is called from Internalize() which is under the hRBMutex, so putting FlushBlock?() in the guarded section of FlushCacheBlock?() wouldn't help here.

But yes there are some issues in the current implementation of gdalrasterband.cpp. See #3225 and #3226. But the race conditions in those tickets seem to be less likely than the ones in your current proposal.

As far as the defaul hash size, it wouldn't be difficult to add a new constructor to CPLHashSet that would select as the initial size the closest prime in the anPrimes[] array to the passed size.

I agree that the hash set implementation is general purpose (provided we find a thread safe implementation) and could/should be implemented in GDALRasterBand rather than in the WMS driver. It could be used in replacement of the "sub-blocking" mechanism for example.

comment:8 Changed 6 years ago by EliL

Cc: EliL added

comment:9 Changed 3 years ago by vindoctor

I've manually been adding this patch for over 2-3 years so I can use WMS. Basically whenever I detect its the WMS driver, I only allow one thread to gather data which removes the threading issue(hack yes, but works). I've been locking outside he sdk, but I guess no reason exists why it cannot block inside the sdk. Does the group have a reason to avoid adding this patch? In my opinion, being able to use the WMS driver is much better than not being able to use it at all.

comment:10 Changed 3 years ago by vindoctor

comment:11 in reply to:  5 Changed 3 years ago by vindoctor

Replying to rouault:

Actually, I back out my suggestion of a per-band mutex to protect all accesses to the hash set. In the scenario I gave, we would have a dead-lock between the band mutex and the hRBMutex of gdalrasterblock.cpp :

  • Thread1 closes the WMS dataset, which calls FlushCache?(), that would take the band mutex and call CPLHashSetRemoveAll().
  • At the same time, Thread2 reads blocks from a GeoTIFF, thus calls GDALRasterBlock::Internalize() which takes the hRBMutex.
  • Thread2 (under the hRBMutex) will try to discard the least recently used blocks from the WMS dataset with FlushBlock?(), which will try to take the band mutex before calling CPLHashSetRemove() --> wait as thread 1 is under the band mutex
  • Thread1 (under the band mutex) in CPLHashSetRemoveAll() will delete blocks, but the destructor of a block tries to take the hRBMutex --> wait as thread 2 is under the hRBMutex --> dead lock

So it seems that accesses to the hash set can only be protected by the hRBMutex itself.

Do you think we can add this patch into the current trunk and only allow one thread if its the wms/tile driver? I've been doing it for a couple years now by manually adding this to the trunk....getting tired of doing this everytime we want to update our gdal build.

comment:12 Changed 3 years ago by Even Rouault

vindoctor,

The gdal-tms.patch attached to this ticket is rather a band-aid than a proper solution. The ultimate solution is to improve the generic block cache mechanism in GDAL core as was intended by RFC 26. I can't find a public discussion on the RFC. Perhaps Tamas did not find the time/funding to push it further... You may want to contact/contract him to see if he's willing to pursue.

comment:13 Changed 2 years ago by Even Rouault

Milestone: 1.8.12.1.0
Resolution: fixed
Status: newclosed

Implemented per RFC 26 r29425

Note: See TracTickets for help on using tickets.