Using matplotlib's prctile on masked arrays

Gokhan_SEVER1 · October 27, 2009, 11:56am

Hello,

Consider this sample two columns of data:

999999.9999 999999.9999
999999.9999 999999.9999
999999.9999 999999.9999
999999.9999 1693.9069
999999.9999 1676.1059
999999.9999 1621.5875
651.8040 1542.1373
691.0138 1650.4214
678.5558 1710.7311
621.5777 999999.9999
644.8341 999999.9999
696.2080 999999.9999

Putting into this data into a file say “sample.data” and loading with:

a,b = np.loadtxt(‘sample.data’, dtype=“float”).T

I[16]: a
O[16]:
array([ 1.00000000e+06, 1.00000000e+06, 1.00000000e+06,
1.00000000e+06, 1.00000000e+06, 1.00000000e+06,
6.51804000e+02, 6.91013800e+02, 6.78555800e+02,
6.21577700e+02, 6.44834100e+02, 6.96208000e+02])

I[17]: b
O[17]:
array([ 999999.9999, 999999.9999, 999999.9999, 1693.9069,
1676.1059, 1621.5875, 1542.1373, 1650.4214,
1710.7311, 999999.9999, 999999.9999, 999999.9999])

interestingly, the second column is loaded as it is but a values reformed a little. Why this could be happening? Any idea? Anyways, back to masked arrays:

I[24]: am = ma.masked_values(a, value=999999.9999)

I[25]: am
O[25]:
masked_array(data = [-- – -- – -- – 651.804 691.0138 678.5558 621.5777 644.8341 696.208],
mask = [ True True True True True True False False False False False False],
fill_value = 999999.9999)

I[30]: bm = ma.masked_values(b, value=999999.9999)

I[31]: am
O[31]:
masked_array(data = [-- – -- – -- – 651.804 691.0138 678.5558 621.5777 644.8341 696.208],
mask = [ True True True True True True False False False False False False],
fill_value = 999999.9999)

So far so good. A few basic checks:

I[33]: am/bm
O[33]:
masked_array(data = [-- – -- – -- – 0.422662755126 0.418689311712 0.39664667346 – -- --],
mask = [ True True True True True True False False False True True True],
fill_value = 999999.9999)

I[34]: mean(am/bm)
O[34]: 0.41266624676580849

Unfortunately, matplotlib.mlab’s prctile cannot handle this division:

I[54]: prctile(am/bm, p=[5,25,50,75,95])
O[54]:
array([ 3.96646673e-01, 6.21577700e+02, 1.00000000e+06,
1.00000000e+06, 1.00000000e+06])

This also results with wrong looking box-and-whisker plots.

Testing further with scipy.stats functions yields expected correct results:

I[55]: stats.scoreatpercentile(am/bm, per=5)
O[55]: 0.40877012449846228

I[49]: stats.scoreatpercentile(am/bm, per=25)
O[49]:
masked_array(data = --,
mask = True,
fill_value = 1e+20)

I[56]: stats.scoreatpercentile(am/bm, per=95)
O[56]:
masked_array(data = --,
mask = True,
fill_value = 1e+20)

Any confirmation?

···

–
Gökhan

system · October 27, 2009, 1:25pm

Hello,

Consider this sample two columns of data:

999999.9999 999999.9999
999999.9999 999999.9999
999999.9999 999999.9999
999999.9999 1693.9069
999999.9999 1676.1059
999999.9999 1621.5875
651.8040 1542.1373
691.0138 1650.4214
678.5558 1710.7311
621.5777 999999.9999
644.8341 999999.9999
696.2080 999999.9999

Putting into this data into a file say "sample.data" and loading with:

a,b = np.loadtxt('sample.data', dtype="float").T

I[16]: a
O[16]:
array([ 1.00000000e+06, 1.00000000e+06, 1.00000000e+06,
1.00000000e+06, 1.00000000e+06, 1.00000000e+06,
6.51804000e+02, 6.91013800e+02, 6.78555800e+02,
6.21577700e+02, 6.44834100e+02, 6.96208000e+02])

I[17]: b
O[17]:
array([ 999999.9999, 999999.9999, 999999.9999, 1693.9069,
1676.1059, 1621.5875, 1542.1373, 1650.4214,
1710.7311, 999999.9999, 999999.9999, 999999.9999])

### interestingly, the second column is loaded as it is but a values
reformed a little. Why this could be happening? Any idea? Anyways, back to
masked arrays:

I[24]: am = ma.masked_values(a, value=999999.9999)

I[25]: am
O[25]:
masked_array(data = [-- -- -- -- -- -- 651.804 691.0138 678.5558 621.5777
644.8341 696.208],
mask = [ True True True True True True False False False
False False False],
fill_value = 999999.9999)

I[30]: bm = ma.masked_values(b, value=999999.9999)

I[31]: am
O[31]:
masked_array(data = [-- -- -- -- -- -- 651.804 691.0138 678.5558 621.5777
644.8341 696.208],
mask = [ True True True True True True False False False
False False False],
fill_value = 999999.9999)

So far so good. A few basic checks:

I[33]: am/bm
O[33]:
masked_array(data = [-- -- -- -- -- -- 0.422662755126 0.418689311712
0.39664667346 -- -- --],
mask = [ True True True True True True False False False
True True True],
fill_value = 999999.9999)

I[34]: mean(am/bm)
O[34]: 0.41266624676580849

Unfortunately, matplotlib.mlab's prctile cannot handle this division:

I[54]: prctile(am/bm, p=[5,25,50,75,95])
O[54]:
array([ 3.96646673e-01, 6.21577700e+02, 1.00000000e+06,
1.00000000e+06, 1.00000000e+06])

This also results with wrong looking box-and-whisker plots.

Testing further with scipy.stats functions yields expected correct results:

This should not be the correct results if you use scipy.stats.scoreatpercentile,
it doesn't have correct missing value handling, it treats nans or
mask/fill values as regular numbers sorted to the end.

stats.mstats.scoreatpercentile is the corresponding function for
masked arrays.

(BTW I wasn't able to quickly copy and past your example because
MaskedArrays don't seem to have a constructive __repr__, i.e.
no commas)

I don't know anything about the matplotlib story.

Josef

···

On Tue, Oct 27, 2009 at 7:56 AM, Gökhan Sever <gokhansever@...287...> wrote:

I[55]: stats.scoreatpercentile(am/bm, per=5)
O[55]: 0.40877012449846228

I[49]: stats.scoreatpercentile(am/bm, per=25)
O[49]:
masked_array(data = --,
mask = True,
fill_value = 1e+20)

I[56]: stats.scoreatpercentile(am/bm, per=95)
O[56]:
masked_array(data = --,
mask = True,
fill_value = 1e+20)

Any confirmation?

--
Gökhan

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@...177...
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Gokhan_SEVER1 · October 28, 2009, 1:47pm

This should not be the correct results if you use scipy.stats.scoreatpercentile,

it doesn’t have correct missing value handling, it treats nans or

mask/fill values as regular numbers sorted to the end.

stats.mstats.scoreatpercentile is the corresponding function for

masked arrays.

Thanks for the suggestion. I forgot the existence of such module. It yields better results.

I[14]: st.mstats.scoreatpercentile(r, per=25)
O[14]:
masked_array(data = 0.401055201111,

         mask = False,
   fill_value = 1e+20)

I[17]: st.scoreatpercentile(r, per=25)
O[17]:
masked_array(data = --,
mask = True,
fill_value = 1e+20)

I usually fall into traps using masked arrays. Hopefully I will figure out these before I make funnier mistakes in my analysis.

Besides, it would be nice to have the “per” argument accepts a sequence instead of a one item. Like matplotlib’s prctile. Using it as: …(array, per=[5,25,50,75,95]) in a one call.

(BTW I wasn’t able to quickly copy and past your example because

MaskedArrays don’t seem to have a constructive repr, i.e.

no commas)

You can copy and paste the sample data from this link. When I copied from a txt file into gmail into somehow distorted the original look of the data.

http://code.google.com/p/ccnworks/source/browse/trunk/sample.data

···

On Tue, Oct 27, 2009 at 8:25 AM, <josef.pktd@…1896…> wrote:

I don’t know anything about the matplotlib story.

Josef

I[55]: stats.scoreatpercentile(am/bm, per=5)

O[55]: 0.40877012449846228

I[49]: stats.scoreatpercentile(am/bm, per=25)

O[49]:

masked_array(data = --,
         mask = True,
   fill_value = 1e+20)
I[56]: stats.scoreatpercentile(am/bm, per=95)

O[56]:

masked_array(data = --,
         mask = True,
   fill_value = 1e+20)
Any confirmation?

–

Gökhan

NumPy-Discussion mailing list

NumPy-Discussion@…2847…

http://mail.scipy.org/mailman/listinfo/numpy-discussion

NumPy-Discussion mailing list

NumPy-Discussion@…177…

http://mail.scipy.org/mailman/listinfo/numpy-discussion

–
Gökhan