Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#2033 closed enhancement (worksforme)

Moving g.pnmcomp to lib/display to improve render performance of wxGUI

Reported by: huhabla Owned by: grass-dev@…
Priority: major Milestone: 7.0.0
Component: wxGUI Version: svn-trunk
Keywords: display, Python, multiprocessing Cc:
CPU: All Platform: All

Description

I would like to move the code of g.pnmcomp into the display library. So we can call it directly as C-function in the wxGUI to avoid file IO and to speedup the rendering process.

I will use the Python multiprocessing module to avoid the crash of the GUI in case of a segmentation fault or in case of an exit call when a fatal error occurs. Hence, to call the C-Library function via ctypes or PyGRASS, a new process will be spawned. All data will be exchanged using Python objects and multiprocessing queues between the wxGUI and its sub-process child's. This can also be used to run several processes in parallel as g.gui.animation already does.

I have implemented a prototype that makes use of this concept. The code is attached as diff to the current grass7 svn-trunk version. The code can be seen as a proof of concept that shows how it might work. The code will also show that it is still possible to call g.pnmcomp as usual.

This concept may also lead to a new implementation guideline: to use more C-Library functions in the wxGUI to speedup the visualization.

My question would be if this is also possible with d.rast, d.vect and other display modules? Hence, moving the code from these modules into the display library and calling these functions from dedicated wxGUI sub-processes to speed up the rendering?

Attachments (2)

gui_improvement.diff (23.1 KB ) - added by huhabla 11 years ago.
Patch to move the code of g.pnmcomp into the display lib to call it from wxGUI as C-function
display_bench.py (4.6 KB ) - added by huhabla 11 years ago.
Benchmark script

Download all attachments as: .zip

Change History (13)

by huhabla, 11 years ago

Attachment: gui_improvement.diff added

Patch to move the code of g.pnmcomp into the display lib to call it from wxGUI as C-function

in reply to:  description ; comment:1 by glynn, 11 years ago

Replying to huhabla:

My question would be if this is also possible with d.rast, d.vect and other display modules? Hence, moving the code from these modules into the display library and calling these functions from dedicated wxGUI sub-processes to speed up the rendering?

Possible? Probably. Sane? No.

Moving the guts of d.rast/d.vect/etc around won't make it run any faster. If the issue is with the communication of the raster data, there are faster methods than reading and writing PNM files.

Both the PNG and cairo drivers support reading and writing 32-bpp BMP files where the raster data is correctly aligned for memory mapping. Setting GRASS_PNGFILE to a filename with a .bmp suffix selects this format, and setting GRASS_PNG_MAPPED=TRUE causes the drivers to mmap() the file rather than using read() and write().

Once you have d.* commands generating BMP files, it shouldn't be necessary to add any binary blobs to wxGUI. Compositing should be perfectly viable within Python using either numpy, PIL or wxPython (having wxPython perform the compositing during rendering may be able to take advantage of video hardware).

Additionally, on X11 (and provided that the cairo library supports it), the cairo driver supports rendering directly into an X pixmap which is retained in the server (typically in video memory) after the d.* program terminates. This has the added advantage that rendering will be performed using the video hardware.

Setting GRASS_PNGFILE to a filename ending in ".xid" selects this option; the XID of the pixmap will be written to that file as a hexadecimal value. The g.cairocomp module can composite these pixmaps without the image data ever leaving video memory ("g.cairocomp -d ..." can be used to delete the pixmaps from the server).

The only missing piece of the puzzle is a way to get wxPython to use an existing pixmap (ideally without pulling it into client memory then pushing it back out to the server). The cleanest approach would be via pycairo and wx.lib.wxcairo, which would also allow g.cairocomp to be eliminated, but that's yet another dependency.

in reply to:  1 ; comment:2 by huhabla, 11 years ago

Replying to glynn:

Replying to huhabla:

My question would be if this is also possible with d.rast, d.vect and other display modules? Hence, moving the code from these modules into the display library and calling these functions from dedicated wxGUI sub-processes to speed up the rendering?

Possible? Probably. Sane? No.

Moving the guts of d.rast/d.vect/etc around won't make it run any faster. If the issue is with the communication of the raster data, there are faster methods than reading and writing PNM files.

I have the hope to speed up the composition by avoiding disc I/O.

Both the PNG and cairo drivers support reading and writing 32-bpp BMP files where the raster data is correctly aligned for memory mapping. Setting GRASS_PNGFILE to a filename with a .bmp suffix selects this format, and setting GRASS_PNG_MAPPED=TRUE causes the drivers to mmap() the file rather than using read() and write().

As far as i understand mmap(), it is file backed and reads/writes the data from the file on demand into the shared memory? An exception is anonymous mapping, but is this also supported on windows? How can we access the anonymous mmap() from wxPython?

Once you have d.* commands generating BMP files, it shouldn't be necessary to add any binary blobs to wxGUI. Compositing should be perfectly viable within Python using either numpy, PIL or wxPython (having wxPython perform the compositing during rendering may be able to take advantage of video hardware).

What do you mean with binary blobs? Binary large objects? Well as i can see from the wx description, there is no way around blobs since even numpy arrays must be converted into a bytearray or similar to create a wx image. Does wxPython take advantage of the video hardware? IMHO we can also implement a OpenCL version of the PNM image composition. In this case it would be a large advantage to have the images created by d.rast and d.vect in a shared memory area as well to avoid disk I/O.

Additionally, on X11 (and provided that the cairo library supports it), the cairo driver supports rendering directly into an X pixmap which is retained in the server (typically in video memory) after the d.* program terminates. This has the added advantage that rendering will be performed using the video hardware.

Setting GRASS_PNGFILE to a filename ending in ".xid" selects this option; the XID of the pixmap will be written to that file as a hexadecimal value. The g.cairocomp module can composite these pixmaps without the image data ever leaving video memory ("g.cairocomp -d ..." can be used to delete the pixmaps from the server).

The only missing piece of the puzzle is a way to get wxPython to use an existing pixmap (ideally without pulling it into client memory then pushing it back out to the server). The cleanest approach would be via pycairo and wx.lib.wxcairo, which would also allow g.cairocomp to be eliminated, but that's yet another dependency.

It is still puzzling me how to create a shared memory buffer using multiprocessing.sharedctypes.Array and use this in the C-function calls. In the current approach i have to use a queue object to transfer the image data from the child process to its parent and therefor the transformation of the image buffer into a Python bytearray. How to access video memory is another point. Are pipes or similar techniques available for this kind of operation? Should we wait for hardware that have no distinction between video and main memory? Using pycairo.BitmapFromImageSurface() seems to be a good approach?

However, we should focus on approaches that work on Linux/Unix/Mac and Windows. Using X11 specific features is not meaningful in my humble opinion. Besides of that the cairo driver is not work with the windows grass installer yet (missing dependencies, with the exception of Markus Metz local installation).

I don't think that calling the d.vect and d.rast functionality as library functions is insane. :) Using library function will allow to use the same image buffers across rendering and composition that can be passed to the wxGUI parent process using the multiprocessing queue. This will not increase the actual rendering speed, but it will avoid several I/O operations and allows I/O independent parallel rendering in case of multi-map visualization. The mmap() approach is not needed in this case as well.

Well, the massive amount of d.vect and d.rast options will make it difficult to design a convenient C-function interface ... but this can be solved.

In the long-term, the current command interface to access the wx monitors is a bit ... lets say ... error prone. It would be an advantage to have the d.* modules as Python modules that are able to talk to the monitors using socket connections or other cross platform IPC methods, sending serialized objects that describe the call of the new (d.vect) vector rendering or (d.rast) raster rendering functions in the display library. In addition these modules can call the display library functions them self for image rendering without monitors.

in reply to:  2 comment:3 by glynn, 11 years ago

Replying to huhabla:

I have the hope to speed up the composition by avoiding disc I/O.

If one process writes a file and another immediately reads it, it doesn't necessarily involve "disc" I/O.

The OS caches disc blocks in RAM. write() completes as soon as the data has been copied to the cache (the kernel will copy it to disc on its own schedule), read() reads the data from the cache (and only requires disc access for data which isn't already in the cache).

The kernel will use all "free" memory for the disc cache. So unless memory pressure is high, the files written by the display driver will remain in the cache for quite a while.

As far as i understand mmap(), it is file backed and reads/writes the data from the file on demand into the shared memory? An exception is anonymous mapping, but is this also supported on windows? How can we access the anonymous mmap() from wxPython?

Anonymous mmap() isn't relevant here. mmap() is file backed, but this doesn't affect the time required to read and write the file unless memory pressure is so high that the size of the file exceeds the amount of free memory. In the event that sufficient free memory is available, neither writing nor reading will block waiting for disc I/O.

Once you have d.* commands generating BMP files, it shouldn't be necessary to add any binary blobs to wxGUI. Compositing should be perfectly viable within Python using either numpy, PIL or wxPython (having wxPython perform the compositing during rendering may be able to take advantage of video hardware).

What do you mean with binary blobs? Binary large objects?

Machine code.

IOW, it shouldn't be necessary to move g.pnmcomp into a library (DLL/DSO) which is accessed from the wxGUI process. The replacement can just be written in Python, using existing Python modules (numpy, PIL or wxPython) to get reasonable performance.

Does wxPython take advantage of the video hardware?

wxWidgets is a cross-platform wrapper around existing toolkits: Windows GDI, GTK/GDK, etc. The underlying toolkit will use the video hardware, but wxWidgets may insist upon inserting itself between the data and the hardware.

IMHO we can also implement a OpenCL version of the PNM image composition.

This won't help much unless you can persuade wxWidgets/wxPython to use the composed image directly. If it insists upon pulling the data from video memory so that it can pass it to a function which just pushes it straight back again, it would probably be quicker to perform the compositing on the CPU.

It is still puzzling me how to create a shared memory buffer using multiprocessing.sharedctypes.Array and use this in the C-function calls.

I'm not sufficiently familiar with the multiprocessing module to answer this question. However, if it turns out to be desirable (and I don't actually think it will), it wouldn't be that hard to modify the PNG/cairo drivers to write into a SysV-IPC shared memory segment (shmat() etc).

But I don't think that will offer any advantages over mmap()d files, and it's certainly at a disadvantage compared to GPU rendering into shared video memory.

Should we wait for hardware that have no distinction between video and main memory?

X11 will always make a distinction between server memory and client memory, as those may be on different physical systems.

Using pycairo.BitmapFromImageSurface() seems to be a good approach?

It may be the best that you're going to get. GDK can can create a GdkPixmap from an XID (gdk_pixmap_foreign_new), and this functionality is exposed by PyGTK. But the higher level libraries all seem to insist upon creating the pixmap themselves from data which is in client memory. Or at least, if the functionality is available, it doesn't seem to be documented.

I don't think that calling the d.vect and d.rast functionality as library functions is insane. :)

Eliminating the process boundary for no reason other than to avoid having to figure out inter-process communication is not sane.

Using library function will allow to use the same image buffers across rendering and composition that can be passed to the wxGUI parent process using the multiprocessing queue.

Using files will allow to use the same "image buffers" (i.e. the kernel's disc cache).

Well, the massive amount of d.vect and d.rast options will make it difficult to design a convenient C-function interface ... but this can be solved.

Solving inter-process communication is likely to be a lot simpler, and the end result will be nicer.

comment:4 by huhabla, 11 years ago

I am not fully convinced using the mmap() approach for IPC. It is not guaranteed that mmap() is faster then the usual file I/O.[1]

I have written a small python script to test the performance of d.rast and d.vect using the north carolina sample dataset. The script is attached. I am using the PNG and Cairo driver, switching mmap on and of for different window sizes. In addition i call the d.rast (1x) and d.vect (2x) modules in parallel to measure a render speed gain. The composition is still missing, but i will add the PIL approach available in the grass wiki with python.mmap support. Here is the script:

# -*- coding: utf-8 -*-
from grass.pygrass import modules
import os
import time

parallel = [True, False]
drivers = ["png", "cairo"]
mmap_modes = ["FALSE", "TRUE"]
sizes = [1024, 4096]

def render_image(module, driver="png", pngfile="test.png",
                 size=4096, mapped="TRUE"):
    os.putenv("GRASS_RENDER_IMMEDIATE", "%s"%driver)
    os.putenv("GRASS_PNGFILE", "%s"%pngfile)
    os.putenv("GRASS_WIDTH", "%i"%size)
    os.putenv("GRASS_HEIGHT", "%i"%size)
    os.putenv("GRASS_PNG_MAPPED", "%s"%mapped)
    module.run()

def composite_images(files):
    pass

def main():
    # Set the region
    modules.Module("g.region", rast="elevation", flags="p")

    for finish in parallel:
        if finish:
            print("*** Serial runs")
        else:
            print("*** Parallel runs")

        # Setup the modules
        rast   = modules.Module("d.rast", map="elevation", run_=False,
                                quiet=True, finish_=False)
        vectB = modules.Module("d.vect", map="streams", width=1, color="blue",
                                fcolor="aqua", type=["area","line"],
                                run_=False, quiet=True, finish_=finish)
        vectA = modules.Module("d.vect", map="roadsmajor", width=2,
                               run_=False, quiet=True, finish_=finish)

        count = 0
        for driver in drivers:
            for mode in mmap_modes:
                for size in sizes:
                    start = time.time()
                    count += 1
                    files = []
                    if mode == "TRUE":
                        rast_file = "rast.bmp"
                        vectA_file="vectA.bmp"
                        vectB_file="vectB.bmp"
                    else:
                        rast_file = "rast.png"
                        vectA_file="vectA.png"
                        vectB_file="vectB.png"

                    render_image(rast, driver=driver, pngfile=rast_file,
                                 size=size, mapped=mode)
                    render_image(vectA, driver=driver, pngfile=vectA_file,
                                 size=size, mapped=mode)
                    render_image(vectB, driver=driver, pngfile=vectB_file,
                                 size=size, mapped=mode)

                    files.append(rast_file)
                    files.append(vectA_file)
                    files.append(vectB_file)

                    # Wait for processes
                    rast.popen.wait()
                    vectA.popen.wait()
                    vectB.popen.wait()

                    # Composite the images
                    composite_images(files)

                    for file in files:
                        os.remove(file)

                    elapsed = (time.time() - start)
                    print("*** Run %i Driver=%s mmap=%s Size=%i time=%f"%(count,
                                                                          driver,
                                                                          mode,
                                                                          size,
                                                                          elapsed))

main()

The result of the benchmark:

GRASS 7.0.svn (nc_spm_08_grass7):~/src > python display_bench.py 
projection: 99 (Lambert Conformal Conic)
zone:       0
datum:      nad83
ellipsoid:  a=6378137 es=0.006694380022900787
north:      228500
south:      215000
west:       630000
east:       645000
nsres:      10
ewres:      10
rows:       1350
cols:       1500
cells:      2025000
*** Serial runs
*** Run 1 Driver=png mmap=FALSE Size=1024 time=0.796055
*** Run 2 Driver=png mmap=FALSE Size=4096 time=3.389201
*** Run 3 Driver=png mmap=TRUE Size=1024 time=0.449877
*** Run 4 Driver=png mmap=TRUE Size=4096 time=3.723065
*** Run 5 Driver=cairo mmap=FALSE Size=1024 time=0.824797
*** Run 6 Driver=cairo mmap=FALSE Size=4096 time=2.632125
*** Run 7 Driver=cairo mmap=TRUE Size=1024 time=0.542321
*** Run 8 Driver=cairo mmap=TRUE Size=4096 time=2.276822
*** Parallel runs
*** Run 1 Driver=png mmap=FALSE Size=1024 time=0.756147
*** Run 2 Driver=png mmap=FALSE Size=4096 time=3.113990
*** Run 3 Driver=png mmap=TRUE Size=1024 time=0.530959
*** Run 4 Driver=png mmap=TRUE Size=4096 time=3.355732
*** Run 5 Driver=cairo mmap=FALSE Size=1024 time=0.865963
*** Run 6 Driver=cairo mmap=FALSE Size=4096 time=2.358270
*** Run 7 Driver=cairo mmap=TRUE Size=1024 time=0.566976
*** Run 8 Driver=cairo mmap=TRUE Size=4096 time=1.934245

There is no mmap() speed improvement for the PNG driver for large window sizes? I will investigate this further.

[1] http://lists.freebsd.org/pipermail/freebsd-questions/2004-June/050371.html

in reply to:  4 comment:5 by glynn, 11 years ago

Replying to huhabla:

Your tests are comparing raw BMP with PNG (which uses zlib compression). If you aren't seeing a significant difference between those two, then the I/O overhead is negligible and performance is dictated by rendering speed.

Regardless of whether I/O uses mmap() or write() and read(), disk transfer doesn't get involved unless memory pressure is so high that the data gets discarded from the cache before it is read. And if memory pressure is that high, disk transfer will get involved anyhow when the "memory" buffers are swapped out (and if memory pressure is high and you don't have swap, then you'll just get an out-of-memory failure).

comment:6 by huhabla, 11 years ago

I have improved the benchmark script and implemented PIL based and g.pnmcomp image composition (without transparency). Now png, bmp and ppm images are created and mmap is enabled for bmp images. Time is measured for the whole rendering/composition process and separately for the composition. My test system is a core-i5 2410M with 8GB RAM and 320Gig HD running Ubuntu 12.04 LTS.

It seems to me that creating raw bmp images without mmap enabled shows the best performance for the PNG and Cairo driver. Maybe i did something wrong, but the use of mmap shows no obvious benefit? The png compression slows the rendering significantly down and is IMHO not well suited as image exchange format in the rendering/composition process.

Running the render processes in parallel shows only for the 4096x4096 pixel size images a significant benefit.

Any suggestions to improve the benchmark? Does my setup produce reasonable results?

Here the script:

# -*- coding: utf-8 -*-
import os
import time
import Image
import wx
from grass.pygrass import modules

parallel = [True, False]
drivers = ["png", "cairo"]
bitmaps = ["png", "bmp", "ppm"]
mmap_modes = ["FALSE", "TRUE"]
sizes = [1024, 4096]

############################################################################

def render_image(module, driver="png", pngfile="test.png",
                 size=4096, mapped="TRUE"):
    os.putenv("GRASS_RENDER_IMMEDIATE", "%s"%driver)
    os.putenv("GRASS_PNGFILE", "%s"%pngfile)
    os.putenv("GRASS_WIDTH", "%i"%size)
    os.putenv("GRASS_HEIGHT", "%i"%size)
    os.putenv("GRASS_PNG_MAPPED", "%s"%mapped)
    module.run()

############################################################################

def composite_images(files, bitmap, mode, size):
    start = time.time()
    if bitmap == "ppm":
        filename = "output"
        filename += ".ppm"
        modules.Module("g.pnmcomp", input=files, width=size, height=size,
                       output=filename)
        # Load the image as wx image for visualization
        img = wx.Image(filename, wx.BITMAP_TYPE_ANY)
        os.remove(filename)
    else:
        images = []
        size = None
        for m in files:
            im = Image.open(m)
            images.append(im)
            size = im.size
        comp = Image.new('RGB', size)
        for im in images:
            comp.paste(im)
        wxImage = wx.EmptyImage(*comp.size)
        wxImage.SetData(comp.convert('RGB').tostring())

    return (time.time() - start)

############################################################################

def main():
    # Set the region
    modules.Module("g.region", rast="elevation", flags="p")

    for finish in parallel:
        if finish:
            print("*** Serial runs")
        else:
            print("*** Parallel runs")

        print("Run\tSize\tDriver\tBitmap\tmmap\trender\tcomposite")

        # Setup the modules
        rast   = modules.Module("d.rast", map="elevation", run_=False,
                                quiet=True, finish_=False)
        vectB = modules.Module("d.vect", map="streams", width=1, color="blue",
                                fcolor="aqua", type=["area","line"],
                                run_=False, quiet=True, finish_=finish)
        vectA = modules.Module("d.vect", map="roadsmajor", width=2,
                               run_=False, quiet=True, finish_=finish)

        count = 0
        for size in sizes:
            for driver in drivers:
                for bitmap in bitmaps:
                    for mode in mmap_modes:
                        # Skip mmap for non-bmp files
                        if mode == "TRUE" and bitmap != "bmp":
                            continue

                        start = time.time()
                        count += 1
                        files = []

                        rast_file = "rast.%s"%(bitmap)
                        vectA_file="vectA.%s"%(bitmap)
                        vectB_file="vectB.%s"%(bitmap)

                        files.append(rast_file)
                        files.append(vectA_file)
                        files.append(vectB_file)

                        render_image(rast, driver=driver,
                                     pngfile=rast_file,
                                     size=size, mapped=mode)

                        render_image(vectA, driver=driver,
                                     pngfile=vectA_file,
                                     size=size, mapped=mode)

                        render_image(vectB, driver=driver,
                                     pngfile=vectB_file,
                                     size=size, mapped=mode)

                        # Wait for processes
                        rast.popen.wait()
                        vectA.popen.wait()
                        vectB.popen.wait()

                        # Composite the images
                        comptime = composite_images(files, bitmap, mode,
                                                    size)

                        for file in files:
                            os.remove(file)

                        elapsed = (time.time() - start)
                        print("%i\t%i\t%s\t%s\t%s\t%.2f\t%.2f"%(count, size,
                                                                driver, bitmap,
                                                                mode, elapsed,
                                                                comptime))

############################################################################

main()

Here the benchmark results:

GRASS 7.0.svn (nc_spm_08_grass7):~/src > python display_bench.py 
projection: 99 (Lambert Conformal Conic)
zone:       0
datum:      nad83
ellipsoid:  a=6378137 es=0.006694380022900787
north:      228500
south:      215000
west:       630000
east:       645000
nsres:      10
ewres:      10
rows:       1350
cols:       1500
cells:      2025000
*** Serial runs
Run	Size	Driver	Bitmap	mmap	render	composite
1	1024	png	png	FALSE	0.87	0.11
2	1024	png	bmp	FALSE	0.45	0.03
3	1024	png	bmp	TRUE	0.48	0.03
4	1024	png	ppm	FALSE	0.47	0.07
5	1024	cairo	png	FALSE	0.93	0.09
6	1024	cairo	bmp	FALSE	0.52	0.03
7	1024	cairo	bmp	TRUE	0.56	0.03
8	1024	cairo	ppm	FALSE	0.61	0.06
9	4096	png	png	FALSE	4.74	1.29
10	4096	png	bmp	FALSE	3.43	0.38
11	4096	png	bmp	TRUE	4.15	0.38
12	4096	png	ppm	FALSE	3.04	0.55
13	4096	cairo	png	FALSE	3.68	0.99
14	4096	cairo	bmp	FALSE	1.95	0.37
15	4096	cairo	bmp	TRUE	2.65	0.37
16	4096	cairo	ppm	FALSE	3.44	0.55
*** Parallel runs
Run	Size	Driver	Bitmap	mmap	render	composite
1	1024	png	png	FALSE	0.92	0.11
2	1024	png	bmp	FALSE	0.50	0.03
3	1024	png	bmp	TRUE	0.48	0.03
4	1024	png	ppm	FALSE	0.51	0.07
5	1024	cairo	png	FALSE	0.98	0.08
6	1024	cairo	bmp	FALSE	0.53	0.03
7	1024	cairo	bmp	TRUE	0.60	0.03
8	1024	cairo	ppm	FALSE	0.67	0.07
9	4096	png	png	FALSE	4.77	1.33
10	4096	png	bmp	FALSE	3.08	0.37
11	4096	png	bmp	TRUE	3.74	0.38
12	4096	png	ppm	FALSE	2.84	0.55
13	4096	cairo	png	FALSE	3.38	1.01
14	4096	cairo	bmp	FALSE	1.82	0.37
15	4096	cairo	bmp	TRUE	2.44	0.37
16	4096	cairo	ppm	FALSE	2.93	0.55

in reply to:  6 comment:7 by glynn, 11 years ago

Replying to huhabla:

It seems to me that creating raw bmp images without mmap enabled shows the best performance for the PNG and Cairo driver. Maybe i did something wrong, but the use of mmap shows no obvious benefit?

Using mmap() in the driver is probably not that significant in this context.

It's more useful when GRASS_PNG_READ=TRUE, the resolution is high, and the rendering is simple and/or limited to a portion of the image. In that situation, mmap() eliminates the read() as well as the write(), and only the modified portion needs to be read and written.

Another area where it matters is with e.g. wximgview.py (and its predecessors), as it's safe to read a BMP image which is being modified using mmap(), whereas doing the same thing to a file which is being written out with write() runs the risk reading a truncated file.

Other than that, the performance difference between using mmap() and read() on the read side boils down to mmap() avoiding a memcpy(). The extent to which that matters depends upon what else you're doing with the data. For wxGUI, it's probably a drop in the ocean.

Any suggestions to improve the benchmark? Does my setup produce reasonable results?

There isn't anything I'd particularly take issue with. However:

  1. With the cairo driver, BMP files use pre-multiplied alpha (because that's what cairo uses internally), whereas PPM/PGM output includes an un-multiplication step. So depending upon your perspective, the cairodriver benchmarks are rigged against PPM or in favour of BMP.
  1. Producing separate results for PPM with g.pnmcomp and PPM with PIL would provide a clearer comparison between the two compositing options and the various formats.

comment:8 by huhabla, 11 years ago

I have updated the display benchmark script to compare the PPM performance of PIL and g.pnmcomp. System: Ubuntu 12.04 LTS, AMD Phenom(tm) II X6 1090T Processor, 16GB RAM, 1TB Harddisk. Please make sure that you have the latest grass7 svn version to reproduce the benchmark results, since there was a bug in the pygrass Module run() function, that did not allow parallel process runs.

GRASS 7.0.svn (nc_spm_08_grass7):~/Downloads > python display_bench.py 
projection: 99 (Lambert Conformal Conic)
zone:       0
datum:      nad83
ellipsoid:  a=6378137 es=0.006694380022900787
north:      228500
south:      215000
west:       630000
east:       645000
nsres:      10
ewres:      10
rows:       1350
cols:       1500
cells:      2025000
*** Serial runs
Run	Size	Driver	Bitmap	mmap	render	composite
1	1024	png	png	FALSE	0.859	0.135  PIL
2	1024	png	bmp	FALSE	0.447	0.044  PIL
3	1024	png	bmp	TRUE	0.446	0.044  PIL
4	1024	png	ppm	FALSE	0.430	0.046  PIL
5	1024	png	ppm	FALSE	0.461	0.066  g.pnmcomp
6	1024	cairo	png	FALSE	0.900	0.102  PIL
7	1024	cairo	bmp	FALSE	0.535	0.055  PIL
8	1024	cairo	bmp	TRUE	0.527	0.045  PIL
9	1024	cairo	ppm	FALSE	0.579	0.050  PIL
10	1024	cairo	ppm	FALSE	0.579	0.051  g.pnmcomp
11	4096	png	png	FALSE	5.106	1.513  PIL
12	4096	png	bmp	FALSE	2.728	0.602  PIL
13	4096	png	bmp	TRUE	2.724	0.596  PIL
14	4096	png	ppm	FALSE	2.402	0.604  PIL
15	4096	png	ppm	FALSE	2.129	0.306  g.pnmcomp
16	4096	cairo	png	FALSE	4.011	1.236  PIL
17	4096	cairo	bmp	FALSE	1.273	0.633  PIL
18	4096	cairo	bmp	TRUE	1.281	0.599  PIL
19	4096	cairo	ppm	FALSE	2.510	0.606  PIL
20	4096	cairo	ppm	FALSE	2.230	0.311  g.pnmcomp
*** Parallel runs
Run	Size	Driver	Bitmap	mmap	render	composite
1	1024	png	png	FALSE	0.856	0.127  PIL
2	1024	png	bmp	FALSE	0.456	0.052  PIL
3	1024	png	bmp	TRUE	0.457	0.044  PIL
4	1024	png	ppm	FALSE	0.442	0.048  PIL
5	1024	png	ppm	FALSE	0.447	0.059  g.pnmcomp
6	1024	cairo	png	FALSE	0.902	0.100  PIL
7	1024	cairo	bmp	FALSE	0.535	0.049  PIL
8	1024	cairo	bmp	TRUE	0.528	0.042  PIL
9	1024	cairo	ppm	FALSE	0.586	0.046  PIL
10	1024	cairo	ppm	FALSE	0.595	0.063  g.pnmcomp
11	4096	png	png	FALSE	4.481	1.535  PIL
12	4096	png	bmp	FALSE	2.331	0.608  PIL
13	4096	png	bmp	TRUE	2.344	0.595  PIL
14	4096	png	ppm	FALSE	2.139	0.603  PIL
15	4096	png	ppm	FALSE	1.808	0.294  g.pnmcomp
16	4096	cairo	png	FALSE	3.374	1.226  PIL
17	4096	cairo	bmp	FALSE	1.269	0.619  PIL
18	4096	cairo	bmp	TRUE	1.283	0.586  PIL
19	4096	cairo	ppm	FALSE	2.117	0.598  PIL
20	4096	cairo	ppm	FALSE	1.790	0.486  g.pnmcomp

by huhabla, 11 years ago

Attachment: display_bench.py added

Benchmark script

in reply to:  8 comment:9 by glynn, 11 years ago

Replying to huhabla:

What I take away from this:

  • PNG has noticeable overhead even for reading, and substantial overhead for writing.
  • BMP versus PPM makes no difference in terms of I/O.
  • When using the cairo driver, un-multiplying the alpha for PPM has a noticeable overhead. As the script doesn't handle cairodriver's BMP files correctly, the figures for cairo/BMP aren't meaningful.
  • g.pnmcomp has higher throughput but also a higher constant overhead, so it's faster than PIL for larger images and slower for smaller images. And the PIL version ignores the PGM file containing the alpha channel.
  • The amount of noise in the timings is noticeable but not all that significant. In theory, the composite timings for a given size and format shouldn't depend upon pngdriver versus cairodriver or mmap'd BMP versus non-mmap'd BMP (although the difference between pngdriver and cairodriver PNGs may indicate differences in options, e.g. compression).

comment:10 by huhabla, 11 years ago

Resolution: worksforme
Status: newclosed

My conclusion:

  • Moving the code of g.pnmcomp, d.vect, d.rast ... d.* into the display library for speedup reason is not meaningful. It might be meaningful when we decide to implement the d.* modules as Python modules that communicate with the wx display using sockets instead of files to call the rendering backend. Well, the same can be achieved by implementing python wrapper modules around the display modules ... so it might be not meaningful at all.
  • The current wxGUI rendering approach using PPM and g.pnmcomp seems to be the most efficient considering the fact that the cairo driver is not yet available in the windows version of grass7. It seems to me that using PIL will not provide a large speedup benefit over g.pnmcmop especially for large images.
  • A small speedup can be achieved when calling the d.* modules in parallel in the GUI, especially when several maps need to be re-rendered.
  • IMHO the only way to speedup the rendering is to make d.rast and d.vect faster.

in reply to:  10 comment:11 by martinl, 11 years ago

Replying to huhabla:

  • The current wxGUI rendering approach using PPM and g.pnmcomp seems to be the most efficient considering the fact that the cairo driver is not yet available in the windows version of grass7. It seems to me that using PIL will not provide a large speedup benefit over g.pnmcmop especially for large images.

small update: since r57542 GRASS 7 is built with cairo support also on Windows.

Martin

Note: See TracTickets for help on using tickets.