Gurus,

I am implementing some simple Principal Component Analysis (PCA) in Python but I have run into trouble with the graphical output. I have calculated my scores and my loadings (just matrices with mean-centered, univariate values) and I want to scatterplot them. However, to make the graph more useful I want to label each dot in the scatter plot and also color it. I am using Matplotlib, Pylab, and Scipy.

For example, given a 3x3 matrix of scores called T, I want to:

T,P,E = PCA_svd( X, standardize = True )

t1, t2 = T[:,0], T[:,1]

properties = dict( alpha = 0.75, c = some_colors )

s1 = scatter( t1, t2 ,s = 50, **properties )

legend()

grid( True )

show()

And the result should show three dots of various colors with a legend describing each color, and a data-label (say a two-character code, like AA, BB, CC) for each data-point.

I understand that pylab.scatter objects are not formatted correctly to use the pylab.legend command, and I was wondering if a patch has been written for this yet. I use Python 2.5.3

I have found one work-around for the legend that plots each group in color and then hacks with a Rectangle object, as follows:

props = dict( alpha = 0.75, faceted = False )

Scores = scatter( t1, t2, c = 'red', s = 50, **props )

Loadings = scatter( p1, p2, c = 'blue', s = 50, **props )

redp = Rectangle( ( 0,0 ), 1, 1, facecolor = 'red' )

bluep = Rectangle( ( 0,0 ), 1, 1, facecolor = 'blue' )

legend( ( redp,bluep ),( 'Scores','Loadings' ) )

grid( True )

show()

This works for varying colors across two groups of points, but it doesn't work for single data-points (it says "ValueError: First argument must be a sequence") and it also does not allow me to label each data-point with a two-char code.

Any shoves in the right direction would be very much appreciated. Links to online examples and source-code especially so.

-Timothy Kinney