Removing boxplot's complexity (mpl v3.0?)

Now that the distinction between Axes.bxp (drawer) and Axes.boxplot() are
well established, I think it's time that we start thinking about how the
high-level, easy-to-use, sane-default API of Axes.boxplot should look.

TL;DR version:
I am proposing that we:
* dramatically simplify Axes.boxplot
* move a small portion of that complexity to Axes.bxp
* move most of that complexity into user-defined functions by way of ...
* adding the option to pass a custom stats function to Axes.boxplot
* (optional) make very minor changes to cbook.boxplot_stats to allow users
to specify pre- and post-calculation transformations (e.g., np.log, np.exp)

I am committing to:
  * implement all proposed changes
  * create user docs outlining the new behavior with practical and complete
examples
  * maintaining these changes for the foreseeable future.

Full version:
I'm willing and quick to admit that boxplot's complexity is born from my
needs that I was trying to address several years ago. I'm sorry for that.
However, I've learned a lot since I first started us down this path and I
think I see a way to get out of the woods.

Currently, boxplot's signature looks like this:

    def boxplot(self, x, notch=None, sym=None, vert=None, whis=None,
                positions=None, widths=None, patch_artist=None,
                bootstrap=None, usermedians=None, conf_intervals=None,
                meanline=None, showmeans=None, showcaps=None,
                showbox=None, showfliers=None, boxprops=None,
                labels=None, flierprops=None, medianprops=None,
                meanprops=None, capprops=None, whiskerprops=None,
                manage_xticks=True, autorange=False, zorder=None):

I think I can get it down to this:

    def boxplot(self, data, label=None, vert=False, whis=None,
                positions=None, widths=None, patch_artist=None,
                shownotch=False, showmeans=None, showcaps=None,
                showbox=None, showfliers=None, boxprops=None,
                flierprops=None, medianprops=None,
                meanprops=None, capprops=None, whiskerprops=None,
                manage_xticks=True, zorder=None, statfxn=None, **stafxn_kwargs):

It doesn't look like much, but dropping ``usermedians`` and
``conf_intervals``, alone would allow us to cut 25 SLOC from boxplot that
contain 9 separate if/else statements nested up to 5 levels deep with a for
loop, and raise up to 2 errors. It's worth noting that these parameters are
likely used by a very few users who are very advanced.

Similarly, the property kwargs (e.g., ``capprops``, ``medianprop``) and the
show kwargs (e.g., ``showbox``, ``showfliers``) can be passed directly to
and handled solely by Axes.bxp. That would remove ~75 LOC from axes.boxplot
(with some amount being moved to Axes.bxp).

What would be added Axes.boxplot to replace this functionality would be a
``statfxn`` option and ``**statfxn_kwargs``.

``statfxn`` would default to ``cbook.boxplot_stats``, which is the current
behavior. But users could easily pass their own function and the necessary
options (e.g., ``bootstrap``) via ``**statfxn_kwargs``.

The big win here is that letting users pass their own stats function to the
top-level boxplot means that users of libraries like seaborn will be able
to incorporate their own confidence interval estimates, data
transformations, etc directly in to the boxplot. It sounds minor, but see
how different boxplots of lognormal data look depending on which space you
compute the stats:
https://github.com/mwaskom/seaborn/issues/432#issuecomment-71501177

With that in mind, I think we could cover ~90% of the use-cases that would
require a custom stats function if we allowed cbook.boxplots_stats to take
pre- and post-calculation transformation functions (e.g., np.log and np.exp
for the example above)

All in all, I think these changes will be surprisingly minor but represent
a major improvement in maintainability. If there's interest, I'd be happy
to draft up the changes very shortly to make all of this a little more
concrete.

-Paul

p.s. An even more spartan approach would be to remove the complexity from
Axes.boxplot and tell advanced users that if they want to customize things,
they need to compute their own stats and use Axes.bxp
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/matplotlib-devel/attachments/20161012/ff20207e/attachment.html>

This sounds reasonable, but I think we can do this over the course of 3
minor releases (1 release warning kwargs are going away, 1 raising custom
error, then gone) given the limited scope of the API change.

Could you also put the text of this email in a MEP (
https://github.com/matplotlib/matplotlib/tree/master/doc/devel/MEP )?

Tom

···

On Wed, Oct 12, 2016 at 4:33 PM Paul Hobson <pmhobson at gmail.com> wrote:

Now that the distinction between Axes.bxp (drawer) and Axes.boxplot() are
well established, I think it's time that we start thinking about how the
high-level, easy-to-use, sane-default API of Axes.boxplot should look.

TL;DR version:
I am proposing that we:
* dramatically simplify Axes.boxplot
* move a small portion of that complexity to Axes.bxp
* move most of that complexity into user-defined functions by way of ...
* adding the option to pass a custom stats function to Axes.boxplot
* (optional) make very minor changes to cbook.boxplot_stats to allow users
to specify pre- and post-calculation transformations (e.g., np.log, np.exp)

I am committing to:
  * implement all proposed changes
  * create user docs outlining the new behavior with practical and complete
examples
  * maintaining these changes for the foreseeable future.

Full version:
I'm willing and quick to admit that boxplot's complexity is born from my
needs that I was trying to address several years ago. I'm sorry for that.
However, I've learned a lot since I first started us down this path and I
think I see a way to get out of the woods.

Currently, boxplot's signature looks like this:

    def boxplot(self, x, notch=None, sym=None, vert=None, whis=None,
                positions=None, widths=None, patch_artist=None,
                bootstrap=None, usermedians=None, conf_intervals=None,
                meanline=None, showmeans=None, showcaps=None,
                showbox=None, showfliers=None, boxprops=None,
                labels=None, flierprops=None, medianprops=None,
                meanprops=None, capprops=None, whiskerprops=None,
                manage_xticks=True, autorange=False, zorder=None):

I think I can get it down to this:

    def boxplot(self, data, label=None, vert=False, whis=None,
                positions=None, widths=None, patch_artist=None,
                shownotch=False, showmeans=None, showcaps=None,
                showbox=None, showfliers=None, boxprops=None,
                flierprops=None, medianprops=None,
                meanprops=None, capprops=None, whiskerprops=None,
                manage_xticks=True, zorder=None, statfxn=None, **stafxn_kwargs):

It doesn't look like much, but dropping ``usermedians`` and
``conf_intervals``, alone would allow us to cut 25 SLOC from boxplot that
contain 9 separate if/else statements nested up to 5 levels deep with a for
loop, and raise up to 2 errors. It's worth noting that these parameters are
likely used by a very few users who are very advanced.

Similarly, the property kwargs (e.g., ``capprops``, ``medianprop``) and the
show kwargs (e.g., ``showbox``, ``showfliers``) can be passed directly to
and handled solely by Axes.bxp. That would remove ~75 LOC from axes.boxplot
(with some amount being moved to Axes.bxp).

What would be added Axes.boxplot to replace this functionality would be a
``statfxn`` option and ``**statfxn_kwargs``.

``statfxn`` would default to ``cbook.boxplot_stats``, which is the current
behavior. But users could easily pass their own function and the necessary
options (e.g., ``bootstrap``) via ``**statfxn_kwargs``.

The big win here is that letting users pass their own stats function to the
top-level boxplot means that users of libraries like seaborn will be able
to incorporate their own confidence interval estimates, data
transformations, etc directly in to the boxplot. It sounds minor, but see
how different boxplots of lognormal data look depending on which space you
compute the stats:
https://github.com/mwaskom/seaborn/issues/432#issuecomment-71501177

With that in mind, I think we could cover ~90% of the use-cases that would
require a custom stats function if we allowed cbook.boxplots_stats to take
pre- and post-calculation transformation functions (e.g., np.log and np.exp
for the example above)

All in all, I think these changes will be surprisingly minor but represent
a major improvement in maintainability. If there's interest, I'd be happy
to draft up the changes very shortly to make all of this a little more
concrete.

-Paul

p.s. An even more spartan approach would be to remove the complexity from
Axes.boxplot and tell advanced users that if they want to customize things,
they need to compute their own stats and use Axes.bxp
_______________________________________________
Matplotlib-devel mailing list
Matplotlib-devel at python.org
https://mail.python.org/mailman/listinfo/matplotlib-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/matplotlib-devel/attachments/20161014/fa3f7ffe/attachment.html>