Opened 11 years ago

Closed 6 years ago

#576 closed defect (fixed)

i.pca fails to center data prior to analysis

Reported by: nikos Owned by: grass-dev@…
Priority: normal Milestone: 6.4.4
Component: Raster Version: svn-develbranch6
Keywords: i.pca, data centering, prcomp(), R, eigenvectors Cc: nikos.alexandris@…
CPU: Unspecified Platform: Unspecified

Description

I have spotted one case where i.pca does not work as expected. I have a set of 3 MODIS surface reflectance bands. Performing PCA on those using i.pca does not center the data before the analysis, that is, the mean of each dimension (band) is not subtracted from the dimension itself to give a dataset that has zero mean which is an integral part of the solution to PCA.

  • i.pca on the _raw_ bands gives the following Eigenvalues + Eigenvectors:
    PC1 6307563.04 (-0.6353,-0.6485,-0.4192) [98.71%]
    PC2   78023.63 (-0.7124, 0.2828, 0.6422) [ 1.22%]
    PC3    4504.60 (-0.2979, 0.7067,-0.6417) [ 0.07%]
    


  • Using the same data with the prcomp(x, center=TRUE, scale=FALSE) function in R, which centers the dataset by default anyway if not told otherwise, gives different results:
               PC1         PC2        PC3
mod07_b2 0.4372107  0.83099407 -0.3439413
mod07_b6 0.7210155 -0.09527873  0.6863371
mod07_b7 0.5375718 -0.54806096 -0.6408165

Note: the output of prcomp() delivers the Principal Components column-wise, while i.pca delivers them row-wise.

  • Further checking reveals that centering the data manually in grass, e.g. using
    r.mapcalc "mod_band_centered = mod_band - mean(mod_band)"
    

gives (almost) the same results as the prcomp() function with the parameter center=TRUE (example above). The numbers talk for themself:

PC1 270343.07 (-0.4403,-0.7222,-0.5335) [79.11%]
PC2  67140.50 (-0.8275, 0.0957, 0.5533) [19.65%]
PC3   4258.14 ( 0.3485,-0.6851, 0.6397) [ 1.25%]

The question is what causes i.pca, in this specific case, not to center the dataset?
Thanks, Nikos

Sources:
The data are available at: grass location with MODIS bands and MODIS bands as geotiff files More details in the archive: Testing i.pca (continued...) and in grass-wiki: Principal Component Analysis

Change History (4)

comment:1 in reply to:  description ; Changed 8 years ago by mmetz

Replying to nikos:

I have spotted one case where i.pca does not work as expected. I have a set of 3 MODIS surface reflectance bands. Performing PCA on those using i.pca does not center the data before the analysis, that is, the mean of each dimension (band) is not subtracted from the dimension itself to give a dataset that has zero mean which is an integral part of the solution to PCA.

[...]

  • Using the same data with the prcomp(x, center=TRUE, scale=FALSE) function in R, which centers the dataset by default anyway if not told otherwise, gives different results:
>                PC1         PC2        PC3
> mod07_b2 0.4372107  0.83099407 -0.3439413
> mod07_b6 0.7210155 -0.09527873  0.6863371
> mod07_b7 0.5375718 -0.54806096 -0.6408165

I get identical results with i.pca in trunk r49090 if I create a MASK first, thus excluding any NULL cells (some cells are NULL in one band but not NULL in another band).

# making use of NULL propagation
r.mapcalc "MASK = mod_b2 + mod_b6 + mod_b7 + 1"

i.pca gives now

Eigen values, (vectors), and [percent importance]:
PC1 778244.03 ( 0.4372, 0.7210, 0.5376) [79.20%]
PC2 192494.58 ( 0.8310,-0.0953,-0.5481) [19.59%]
PC3  11876.45 ( 0.3439,-0.6863, 0.6408) [ 1.21%]

which is identical to the above output of prcomp(x, center=TRUE, scale=FALSE)

Markus M

comment:2 in reply to:  1 Changed 8 years ago by mmetz

Replying to mmetz:

I get identical results with i.pca in trunk r49090 if I create a MASK first, thus excluding any NULL cells (some cells are NULL in one band but not NULL in another band).

Calculating a MASK in order to get results identical to prcomp(x, center=TRUE, scale=FALSE) is no longer needed as of r49092.

Markus M

comment:3 Changed 6 years ago by neteler

Milestone: 6.5.06.4.4

Has this been already backported to relbr6/develbr6?

Potentially: r51647 and r52890.

Please close the ticket if the issue got solved.

comment:4 Changed 6 years ago by nikos

Resolution: fixed
Status: newclosed

I confirm that this works fine now (GRASS 6.4.4svn (2014) / Revision: 50937), ie:

i.pca input=mod07_b2,mod07_b6,mod07_b7 rescale=0,0 out=pca_mod_b267_2014

gives

PC1 778244.03 (-0.4372,-0.7210,-0.5376) [79.20%]
PC2 192494.58 (-0.8310, 0.0953, 0.5481) [19.59%]
PC3  11876.45 ( 0.3439,-0.6863, 0.6408) [ 1.21%]

which is essentially identical to R's equivalent action via

prcomp(mod07_b267, center=TRUE, scale=FALSE)

which in turn gives

Standard deviations:
[1] 882.1814 438.7420 108.9791

Rotation:
               PC1         PC2        PC3
mod07_b2 0.4372107  0.83099407 -0.3439413
mod07_b6 0.7210155 -0.09527873  0.6863371
mod07_b7 0.5375718 -0.54806096 -0.6408165

Just for clarity/completeness, R reports: 1) column-wise and b) the "square roots of the eigenvalues", which in this case match GRASS' i.pca reported eigenvalues. Some R code:

> ev <- c ( 778244.03, 192494.58, 11876.45) # GRASS-reported eigenvalues
> sapply(ev, sqrt)
[1] 882.1814 438.7420 108.9791
Note: See TracTickets for help on using tickets.