Opened 16 years ago

Closed 15 years ago

Last modified 15 years ago

#2536 closed defect (wontfix)

JPEGDataset::IRasterIO optimizations

Reported by: list67 Owned by: warmerdam
Priority: normal Milestone:
Component: GDAL_Raster Version: unspecified
Severity: normal Keywords: JPEGDataset, IRasterIO
Cc:

Description (last modified by warmerdam)

The following is a change to the JPEGDataset::IRasterIO optimizations. It handles the additional case where nPixelSpace == 3, and it improves the nPixelSpace > 3 case by removing the LoadScanLine(...) call that was made for each x value.

CPLErr JPGDataset::IRasterIO( GDALRWFlag eRWFlag, 
  int nXOff, int nYOff, int nXSize, int nYSize, 
  void *pData, int nBufXSize, int nBufYSize, 
  GDALDataType eBufType, 
  int nBandCount, int *panBandMap, 
  int nPixelSpace, int nLineSpace, int nBandSpace ) 
 
{ 
 if((eRWFlag == GF_Read) && 
  (nBandCount == 3) && 
  (nBands == 3) && 
  (nXOff == 0) && (nXOff == 0) && 
  (nXSize == nBufXSize) && (nXSize == nRasterXSize) && 
  (nYSize == nBufYSize) && (nYSize == nRasterYSize) && 
  (eBufType == GDT_Byte) && (sDInfo.data_precision != 12) && 
  (nPixelSpace >= 3) && 
  (nLineSpace == (nPixelSpace*nXSize)) && 
  (nBandSpace == 1) && 
  (pData != NULL) && 
  (panBandMap != NULL) && 
  (panBandMap[0] == 1) && (panBandMap[1] == 2) && (panBandMap[2] == 3)) 
 { 
  Restart(); 
  int y; 
  CPLErr tmpError; 
  if(nPixelSpace == 3) 
  { 
  // handles copy case 
  for(y = 0; y < nYSize; ++y) 
  { 
  tmpError = LoadScanline(y); 
  if(tmpError != CE_None) return tmpError; 
  memcpy(&(((GByte*)pData)[y*nLineSpace]), (const GByte*)pabyScanline, 3*nBufXSize); 
  } 
  } 
  else 
  { 
  int x; 
  // handles copy with padding case 
  for(y = 0; y < nYSize; ++y) 
  { 
  tmpError = LoadScanline(y); 
  if(tmpError != CE_None) return tmpError; 
  for(x = 0; x < nXSize; ++x) 
  { 
  memcpy(&(((GByte*)pData)[(y*nLineSpace) + (x*nPixelSpace)]), (const GByte*)&(pabyScanline[x*3]), 3); 
  } 
  } 
  } 
  return CE_None; 
 } 
  return GDALPamDataset::IRasterIO(eRWFlag, nXOff, nYOff, nXSize, nYSize, 
  pData, nBufXSize, nBufYSize, eBufType, 
  nBandCount, panBandMap, 
  nPixelSpace, nLineSpace, nBandSpace); 

Attachments (2)

jpegReadTest.cpp (3.0 KB ) - added by list67 16 years ago.
test program
red.jpg (3.3 KB ) - added by list67 16 years ago.

Download all attachments as: .zip

Change History (15)

comment:1 by warmerdam, 16 years ago

Component: defaultGDAL_Raster
Description: modified (diff)
Milestone: 1.6.0
Status: newassigned

comment:2 by warmerdam, 16 years ago

Mike,

My concern with this optimized IRasterIO() implementation is that it defeats normal block caching on jpeg datasets. What is the goal of the optimized implementation? The IReadBlock() implementation already forces green and blue into the block cache if only red is requested, so generally speaking the file will only be read once if the block cache is reasonably large. Is it just to avoid the machinery of pushing data into the block cache, and pulling it back from there in the generic RasterIO() implementation? If that is the concern, it seems that the same could be done for every scanline oriented format - it isn't a special concern for JPEG.

comment:3 by list67, 16 years ago

Frank,

I think that the goal is to read an entire raster into memory as quickly as possible.

It has been a while since the optimization was written. The text from an old email included an estimate that the noticeably slower performance (without the optimization) might be due to each pixel in the image being iterated through for each band (in a GDALCopyWords). I remember stepping through the read process and seeing a lot happening. The optimization's per-scanline memcpy (in the nPixelSpace == 3 case), and per-pixel memcpy (in the nPixelSpace > 3) case, might have resulted in faster reads due to less per-pixel iteration.

The optimization was confined to a very narrow case (reading an entire 3 band, 8 bit, interleaved JPEG). That case was known to be very common for software that I work on. And the case seemed that it might be a very likely common read case for other software, since most of the JPEGs that I've seen are relatively small 3 band, 8 bit, JPEGs.

Since the optimization was written a while ago, it might be good for me to run some tests again to see how the performance compares with and without the optimization.

Thank you.

Mike

by list67, 16 years ago

Attachment: jpegReadTest.cpp added

test program

comment:4 by list67, 16 years ago

Frank,

The attatched program was used to test the optimization with gdal 1.5.2. In some of the runs, the optimization was commented out. The non-optimized runs were about 3 times slower.

Mike

comment:5 by Even Rouault, 16 years ago

Frank,

I think that Mike's optimization could be considered, if made conditionnal by a CSLTestBoolean( CPLGetConfigOption( "GDAL_ONE_BIG_READ", "NO") ) test, as it is done in RawRasterBand::IRasterIO().

So people really knowing what they do could use the direct IO without caching.

comment:6 by list67, 16 years ago

A condition for whether or not to use the optimization seems like a good idea.

I don't know much about CPLGetConfigOption settings. Is that a setting that could affect other parts of an application? If a CPLGetConfigOption condition is added, it seems that using a new variable (instead of GDAL_ONE_BIG_READ) would allow for more fine-grained configuration. For example, if "GDAL_JPEG_IO_OPT_READALL_WOCACHE" were used instead of GDAL_ONE_BIG_READ, then a program could use the JPEG read optimization without affecting (for example) the way that RawRasterBand::IRasterIO(..) works.

comment:7 by warmerdam, 16 years ago

Mike,

I have no objection to your employing this specialized code but I'm still not keen on incorporating this into the standard distribution without understanding why the specialization is so much faster, and if we shouldn't be doing something in the generic code to accelerate it.

comment:8 by Even Rouault, 16 years ago

OK, I've done a bit of analysis.

Here are my findings :

  • the slowness of the non optimized case is mostly due to a block cache size not big enough. As GDALDataset::IRasterIO iterates on each band, when it has finished dealing with the first band, the blocks from the first scanlines that have been cached for band 2 and 3 at the end of JPGRasterBand::IReadBlock have been discarded. So a new jpeg decompression must be done. Solution : when the dataset declares pixel interleaved data and that nXSize == nBufXSize && nYSize == nBufYSize, it's more efficient - and safe - to call BlockBasedRasterIO. Done in r15250
  • the second cause is some slowness in GDALCopyWords. When we cache the data from scanline to block cache (3 byte spacing -> 1 byte spacing), and we retrieve from block cache to user buffer (1 byte spacing -> 3 byte spacing). I've done some optimizations for those cases and high number of words to transfer by doing some loop unrolling. Done in r15249

All in all, here are the results :

$ time GDAL_CACHEMAX=1  JPEG_OPTIMIZATION=NO  ./jpegReadTest st8745.jpg 20
real    0m27.025s
user    0m25.774s
sys     0m1.104s

$ time GDAL_CACHEMAX=1  JPEG_OPTIMIZATION=YES ./jpegReadTest st8745.jpg 20
real    0m24.177s
user    0m22.985s
sys     0m1.060s

where JPEG_OPTIMIZATION controls whether or not Mike's optimized version of JPGDataset::IRasterIO is used or not, and st8745.jpg is a 4000x4000 JPEG dataset.

So the non spezialized version is just about 10% slower than the spezialized one. This difference is due to the block cache mechanism.

I think we could even probably now completely remove JPGDataset::IRasterIO implementation.

comment:9 by Even Rouault, 16 years ago

Mike,

did my commits help for your use case ?

comment:10 by list67, 16 years ago

Thank you for the analysis.

The changes don't seem to improve run times in my use case.

Today, I checked out the code using svn, and built a version of the library with the JPEG optimization from above and another with that optimization commented out (leaving only the GDALPamDataset::IRasterIO call).

I then rebuilt the jpegReadTest.cpp program and ran tests with each build of the library. The following test takes about 10 seconds with the JPEG optimization and about 39 seconds without the JPEG optimization: jpegRead_O2.exe ./red.jpg 1000

I then rebuilt a debug version of GDAL without the JPEG optimization, and stepped through the code to make sure it was getting to the modified rasterio.cpp and gdaldataset.cpp sections during the jpeg reads. The process did enter the modified sections of code in both cases.

I'll attach the .jpg file that I was using for testing.

by list67, 16 years ago

Attachment: red.jpg added

comment:11 by Even Rouault, 16 years ago

With your red.jpg, I've done a few tests (on a optimized build) and I got :

$ time JPEG_OPTIMIZATION=YES ./jpegReadTest red.jpg 200 real 0m7.726s user 0m7.180s sys 0m0.416s

$ time JPEG_OPTIMIZATION=NO ./jpegReadTest red.jpg 200 real 0m9.851s user 0m8.949s sys 0m0.776s

So that's only a 25% difference. I'm surprised that you got a 4x difference. Would that mean that the faster the computer (mine is clearly much slower than yours), the faster the optimized way ? Do you have a way to run a profiler so that we have some elements to understand what you observe ? I use sysprof on Linux for that.

One could also wonder if that benchmark is really significitant. Reading several hundreds times the same file means that the OS will cache it in its internal buffers, thus it eliminates the I/O access time, which is the most significant one in most real-life use cases. For example, when running the jpegRead with one iteration, the runtime from one run to another one varies a lot, and there's no significant difference between the optimized path and the less optimized one (it is around 0.35 s for me) It is always possible to write optimizations and you end up writing everything in assembly code... More seriously, it is difficult for GDAL to guess the best strategy, as the user may need to access again the data after the initial big read. Clearly your test case doesn't need the cache mechanism, but it is not obvious how GDAL could guess that. The environment variable is one, but it is a quite obscure one. All that lets me a bit puzzled

comment:12 by warmerdam, 15 years ago

Resolution: wontfix
Status: assignedclosed

I think for the time being we are not going to do any special case for this.

comment:13 by warmerdam, 15 years ago

Milestone: 1.6.1
Note: See TracTickets for help on using tickets.