Mlab - Rec_Summarize / Rec_GroupBy

Good morning –

Got a question for a mlab module guru.

After some experimentation (and judicious peeking at the
source code), I think I’ve got the hang of writing custom functions to
pass into these modules – basically, anything that accepts a list of
values sliced from a single column on the structured array and returns a single
list seems to work well. In functional programming terms, rec_summarize appears
similar to “map”, rec_groupby appears similar to
“reduce”.

Now – what if I want to derive a calculation from
multiple statistics in the original dataset – eg. create a new column on
the array which is derived from 2 (or up to n) other fields in a custom
function which I pass into the process?

For example, conditional counts/summaries (count
transactions and sum the sales on all orders that weighed > 5K lbs).

Is there a way to do this within numpy or mlab without going
all the way out to python and creating a list comprehension?

Thanks.

John

There are a couple of ways with the existing functions.

One is to use a logical mask::

   mask = r.weight>5
   rg = mlab.rec_groupby(r[mask], groupby, stats)

You could also create a new categorical variable with one or more
values and attach it to your record array and then use rec_groupby::

  heavy = np.where(r.weight>5, 1, 0)

and add that to your record array

  r = mlab.rec_append_fields(r, ['heavy'], [heavy])

and then do a rec_group_by using 'heavy' as your group by attribute.

Brian Schwartz has a preliminary implementation of rec_query which
allows you to make a SQL query on a record array by converting it to a
sqllite table, running the sql query, and returning the results as a
new record array, which would solve your problem more cleanly and
generically. The code needs a little more polishing, but perhaps
Brian you can send over what you have in case John wants to take a
look.

JDH

···

On Fri, Jul 1, 2011 at 11:14 AM, Hackett, John (Norcross, GA) <John.Hackett@...3654...> wrote:

After some experimentation (and judicious peeking at the source code), I
think I’ve got the hang of writing custom functions to pass into these
modules – basically, anything that accepts a list of values sliced from a
single column on the structured array and returns a single list seems to
work well. In functional programming terms, rec_summarize appears similar to
“map”, rec_groupby appears similar to “reduce”.

Now – what if I want to derive a calculation from multiple statistics in the
original dataset – eg. create a new column on the array which is derived
from 2 (or up to n) other fields in a custom function which I pass into the
process?

For example, conditional counts/summaries (count transactions and sum the
sales on all orders that weighed > 5K lbs).

Is there a way to do this within numpy or mlab without going all the way out
to python and creating a list comprehension?