Opened 3 years ago

Closed 2 years ago

#3751 closed enhancement (fixed)

Parallel support for aggregates with transitions for raster

Reported by: robe Owned by: pramsey
Priority: high Milestone: PostGIS 2.5.0
Component: postgis Version: 2.3.x
Keywords: Cc: pramsey

Description

I didn't think aggregate functions that use transitions (take internal as input) are parallelizable.

Turns out that even in PostgreSQL 9.6 it is possible to do so as long as you define a serialize and deserialize function for them. These are new args to CREATE AGGREGATE in 9.6

https://www.postgresql.org/docs/9.6/static/sql-createaggregate.html

CREATE AGGREGATE name ( [ argmode ] [ argname ] arg_data_type [ , ... ] ) (
    SFUNC = sfunc,
    STYPE = state_data_type
    [ , SSPACE = state_data_size ]
    [ , FINALFUNC = ffunc ]
    [ , FINALFUNC_EXTRA ]
    [ , COMBINEFUNC = combinefunc ]
    [ , SERIALFUNC = serialfunc ]
    [ , DESERIALFUNC = deserialfunc ]
    [ , INITCOND = initial_condition ]
    [ , MSFUNC = msfunc ]
    [ , MINVFUNC = minvfunc ]
    [ , MSTYPE = mstate_data_type ]
    [ , MSSPACE = mstate_data_size ]
    [ , MFINALFUNC = mffunc ]
    [ , MFINALFUNC_EXTRA ]
    [ , MINITCOND = minitial_condition ]
    [ , SORTOP = sort_operator ]
    [ , PARALLEL = { SAFE | RESTRICTED | UNSAFE } ]
)

From the docs


serialfunc

An aggregate function whose state_data_type is internal can participate in parallel aggregation only if it has a serialfunc function, which must serialize the aggregate state into a bytea value for transmission to another process. This function must take a single argument of type internal and return type bytea. A corresponding deserialfunc is also required.

deserialfunc

Deserialize a previously serialized aggregate state back into state_data_type. This function must take two arguments of types bytea and internal, and produce a result of type internal. (Note: the second, internal argument is unused, but is required for type safety reasons.)


It would be really nice if we can do this because then we have the possiblity of making really costly aggregates like ST_Union and the various raster aggregates parallelizable.

Change History (7)

comment:1 Changed 3 years ago by komzpa

This ticket reminds me of #3650 - there was a partially committed patch from https://github.com/postgis/postgis/pull/117

parallel=safe there doesn't mean it has to be parallel inside the aggregate - say, a GROUP BY can in theory calculate different bucket's extent in different backend in non-parallel fashion. that option doesn't exist now, since the aggregates are marked parallel unusafe.

comment:2 Changed 3 years ago by robe

Cc: pramsey added

I was hoping this was an already tested bit, but it seems I was mistaken, array_agg and string_agg do not use the new serialfunc, deserialfunc. Most cases I found in postgresql source code are a bit trivial-er. I asked on mailing list why this is so, and seems they might not have gotten to it.

Last comment from Andres Freund:

https://www.postgresql.org/message-id/20170606192609.lgwzs2pcr4ng7et5%40alap3.anarazel.de

> > It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
> > aggregate functions using transitions, with the addition of serialfunc and
> > deserialfunc to the aggregate definitions.
> > 
> > https://www.postgresql.org/docs/10/static/sql-createaggregate.html
> > 
> > I was looking at the PostgreSQL 10 source code for some example usages of
> > this and was hoping that array_agg and string_agg would support the feature.
> > At a cursory glance, it seems they do not use this.
> > Examples I see that do support it are the average and standard deviation
> > functions.
> > 
> > Is there a reason for this or it just wasn't gotten to?

I'd suggest trying to write a parallel version of them ;).  Shouldn't be
too hard.

@pramsey have you had a chance to look at this yet?

comment:3 Changed 2 years ago by robe

I've marked all raster functions as parallel safe at r15536. I'll finish off with geometry and then if we get to adding the new logic to REALLY make it allow it to be parallelized that can come later.

comment:4 Changed 2 years ago by robe

In 15537:

Mark all aggregate and window functions as parallel safe, mark more regular functions parallel safe, schema qualify more missing spots
References #3751 for PostGIS 2.4 (trunk)

comment:5 Changed 2 years ago by robe

Summary: Parallel support for aggregates with transitionsParallel support for aggregates with transitions for raster

ST_CountAgg ST_SummaryStatsAgg

comment:6 Changed 2 years ago by robe

Milestone: PostGIS 2.4.0PostGIS 2.5.0

I'm willing to forgo this one. Push back to 2.4 if you can actually get to it.

comment:7 Changed 2 years ago by pramsey

Resolution: fixed
Status: newclosed

I'm closing this ticket in favour of the more action-oriented #3836, which is targeted to 2.5. Your marking of everything as parallel safe has accomplished the most important aspect of the work for this release.

Note: See TracTickets for help on using tickets.