Improving feature discoverability by centralizing documentation of "implicit" types

brunobeltran · June 8, 2020, 8:24pm

Abstract

Matplotlib is the de facto standard library atop which the bulk of the Python visualization ecosystem is built, including such popular packages as cartopy, pandas, seaborn, and many others.

Over the years, Matplotlib has evolved to contain a stunning array of features, however discoverability remains a central problem for our documentation.
There is a vast existing store of matplotlib documentation that can be broken
up largely into tutorials, API documentation, and examples, (ignoring “meta” documentation, e.g. installation how-to’s, contribution guides, etc.).
However, while anything not covered in detail with the API documentation is likely to have a tutorial or example written about it, it seems that even amongst core developers the best way to figure out the “latest and greatest” way to accomplish a task is often to ask the community.

In order to remedy this problem, I propose to integrate the API documentation with a large swath of existing tutorials and examples by attempting to systematically document what I will call Matplotlib’s “implicit types”.
These are a set of core concepts which, due to their “duck type”-heavy nature, have not yet merited their own proper class, but nonetheless are commonly accepted as inputs to a variety of Matplotlib routines internally, and whose documentation is currently scattered throughout docstrings, tutorials, and sometimes even just in examples.

The goal is to create a path to discoverability of Matplotlib’s core features for one of our largest user groups: the so-called “copy/paste/edit” developer (see the User research section below).
We will leverage our existing Sphinx/Readthedocs infrastructure by simply creating a new reStructuredText role :type: which can be used in place of the usual numpydoc type syntax when appropriate.
For example, in Line2D:

"""
color : :type:`Color`, default: :rc:`lines.edgecolor`
"""

would link to a comprehensive “Color” tutorial (currently links to
Line2D.set_color).

The goals of this change are to

Make it to find the “common”/“easiest” option for a parameter (preferably in zero clicks).
Make it easy to “scroll down” to see more advanced options (preferably with one click, max).
Provide a centralized strategy for linking top-level “API” docs to the relevant “tutorials”.
Avoid API-doc-explosion, where scanning through the many possible options to each parameters makes individual docstrings unwieldy.

User research

Because matplotlib users are such a diverse group, creating documentation that
works for everyone—from those for whom matplotlib is their introduction to
Python to seasoned data scientists and library writers hoping to wrap
matplotlib into their own packages—is a herculean task. In order to identify
what unmet needs are most relevant for different kinds of users, I watched
several matplotlib users undertake assigned tasks using the library, where the
users were split into three different categories of experience with matplotlib
and Python:

Experienced with matplotlib and Python
Experienced with Python, but not plotting or matplotlib specifically
Experienced with design, but with neither Python nor matplotlib

TODO: Fill in this section with full write-up.

The main takeaway was that both types 1 and 2 are “copy/paste/edit” developers
who do the majority of their editing with the docs pulled up, and only discover
features by googling in plain english when they have a specific task they want
to accomplish that they don’t immediately found in the API docs (or in the docs
for the example they are copying from).

Detailed proposal

Historically, matplotlib’s API has relied heavily on string-as-enum
“implicit types”. Besides mimicking matlab’s API, these parameter-strings allow the
user to pass semantically-rich values as arguments to matplotlib functions
without having to explicitly import or verbosely prefix an actual enum value
just to pass basic plot options (i.e. plt.plot(x, y, linestyle='solid') is
easier to type and less redundant than plt.plot(x, y, linestyle=mpl.LineStyle.solid)).

Many of these string-as-enum implicit types have since evolved more sophisticated
features. For example, a linestyle can now be either a string or a 2-tuple
of sequences, and a MarkerStyle can now be either a string or a path. While this
is true of many implicit types, MarkerStyle is the only one (to my knowledge) that
has the status of being a proper Python type.

Because these implicit types are not classes in their own right, Matplotlib has
historically had to roll its own solutions for centralizing documentation and
validation of these implicit types (e.g. the docstring.interpd.update docstring
interpolation pattern and the cbook._check_in_list validator pattern,
respectively) instead of using the standard toolchains.

While these solutions have worked well for us, the lack of an explicit location
to document each implicit type means that the documentation is often difficult to
find, large tables of allowed values are repeated throughout the documentation,
and often an explicit statement of the scope of a implicit type is completely
missing from the docs. Take the plt.plot docs, for example. In the “Notes”,
a description of the matlab-like format-string styling method mentions
linestyle, color, and markers options. There are many more ways to
pass these three values than are hinted at, but, for many users, this is their
only source of understanding about what values are possible for those options
until they stumble on one of the relevant tutorials. In the table of Line2D
attributes, the linestyle entry does a good job of linking to
Line2D.set_linestyle where those options are described, but the color
and markers entries do not. color simply links to Line2D.set_color,
which does nothing in the way of offering intuition on what kinds of inputs are
allowed.

… It can be argued that plt.plot is a good candidate to be explicitly
excempted from any documentation best practices we try to codify, and I’ve
chosen it intentionally to elicit the strongest opinions from everyone.

It could be argued that this is something that can be fixed by simply tidying
up the individual docstrings that are causing problems, but the issue is
unfortunately much more systemic than that. Without a centralized place to find
the documentation, this will simply lead to us having more and more copies of
increasingly verbose documentation repeated everywhere each of these implicit
types is used. The alternative, of scattering the information throughout the
documentation, will instead lead to the users having to slowly piece together
their mental model of each implicit type through wiki-diving style traversal
throughout our documentation, or piecemeal from StackOverflow examples.

Ideally, a mention of linestyle in the LineCollection docs should
instead link to the same place as it does in the plt.plot docs. By
organizing these linestyle-specific docs in order from most-common to
most-complex input types, we can maintain a “single-click-to-discover” property
for our advanced plotting options, while also making sure that we don’t hurt
usability for users that simply want to know the simplest way to accomplish a
common task.

Practically speaking, the actual information that we want to have in the
LineCollection docs is just:

A link to complete docs for allowable inputs (like those found in
Line2D.set_linestyle).
A plain words description of what the parameter is meant to accomplish. To
matplotlib power users, this is evident from the parameter’s name, but for
new users this need not be the case. (e.g. linestyle: a description of whether the stroke used to draw each line in the collection is dashed, dotted or solid).
A link to any tutorials that visually depict the possible options (currently
found only after already clicking through to the Line2D.set_linestyle
docs).

In order to make this information available for all implicit types, helping the
continued improval of the consistency and readability of the docs, we propose
the following best-practices for handling implicit types:

Implicit type documentation should be centralized at a dedicated page, where
the easiest/most common/simplest options are plainly documented in a
separate section on the top of the page, and more advanced options can be
found by simply scrolling down.
Functions that accept implicit types as parameters should link to the
appropriate :type: docs.
If a implicit type is a “string-as-enum”, it should simply be made an
Enum, and each possible value should have a Sphinx-parseable documentation
string.

In particular, notice that (1) would replace large copies of tables of possible
linestyles, markerstyles, etc, with links to the complete documentation for
each. Without all the visual noise from these tables of valid options, the
relevant functions would be free to visibly link to tutorials where these
options are visually demonstrated.

The way this would look in the actual docs is just

"""
linestyles: :type:`Linestyle` or list thereof, default: :rc:`lines.linestyle`
"""

would link to a comprehensive explanation of what “linestyles” are allowable,
similar to what is currently found at
:doc:`/gallery/lines_bars_and_markers/linestyles.html` , (currently does not link to anything at all!)

Some benefits of this approach include:

Less likely for docs to become stale, due to centralization.
Increased discoverability of advanced options. If the simple linestyle option
'-' is documented alongside more complex on-off dash specifications,
users are more likely to scroll down than they are to stumble across an
unlinked-to tutorial that describes a feature they need.
Canonicalization of many of matplotlib’s “implicit standards” (like what is a
“bounds” versus and “extents”) that currently have to be learned by reading
the code.
The process would likely highlight issues with API consistency in a way that
could be more easily tracked via Issues, helping with the process of
improving our API (see below for discussion).
Becoming more compatible with potentially adding typing to the library.
Faster doc build times, due to significant decreases in the amount of
text needing to be parsed.

Implementation

This proposal would create one centralized “tutorial” page per implicit type.
For types with complex construction requirements, we would produce and use
classmethods for explicit construction from a known type, but __init__
would continue to hold the logic required to deduce how to construct the type
from the type of the input.

All functions that accept this implicit type as a parameter would have their
docstrings changed to simply use the numpydoc “input type” syntax to link to
this new class. All functions which use this implicit type (i.e. would raise on
an invalid input) would construct an explicit object instance using the general
__init__, allowing the new class to handle validation.

The implicit types that I propose require better organized tutorials

capstyle
joinstyle
bounds
extents
linestyle
colors
colornorm/colormap
ticks
Probably others…

Related issues

Some common discoverability issues that this proposal does not address involve
parameters whose allowable types depend on other parameters (for example
x and y in plt.plot depending on data.

Alternatives

I submitted a similar proposal as MEP30, which instead of simply adding
documentation, actually proposes to make each of these concepts into a new
style class, so that much of what is effectively tutorials documentation
effectively becomes just API documentation, which can be linked using the
standard numpydocs conventions for types of parameters.

Timeline

//TODO

brunobeltran · June 8, 2020, 8:29pm

This proposal was original submitted as MEP30, but @tacaswell and @story645 suggested I submit it as a GSOD proposal here.

I created this now so that I can start to get feedback, but will continue to edit the original post slightly as I finish writing up my user research and incorporating people’s suggestions to how to do MEP30 better. My hope is that after incorporating feedback, the original post here will be a fully-fledged GSOD proposal, ready to submit.

I am happy to mentor anybody for whom this proposal sounds interesting, but otherwise I will be doing this work anyway, so it would be great to get paid for it

For writing samples please see any of the papers I have published in my CV:
http://brunobeltran/files/brunocv.pdf

In particular:
https://link.aps.org/doi/10.1103/PhysRevLett.123.208103

Or any of my major PRs to matplotlib, including:

everything linked to from #16891
my take on the Collections docs (#17575), which inspired this proposal

tacaswell · July 2, 2020, 7:52pm

@brunobeltran think this is a strong proposal, please make sure to finish submitting it to google before the deadline.

I have some slight concerns about the exact proposed spelling of the :type: sphinx role and that the documentation work will get side-tracked fixing the inevitable miss-matches we are going to find.

brunobeltran · July 6, 2020, 6:52pm

Thanks @tacaswell, I’ll include some language about limiting the scope of the proposal to actually defining these matplotlib “concepts”, since while writing up the documentation (based on reading through the code so far) it will (usually) be easy enough to just open an issue whenever I find a mis-match, instead of trying to fix it completely.

I am definitely still open to suggestions RE: the spelling of the :type: Sphinx role (after giving it more thought, maybe having it be called :concept: would make more sense?) I’ll bring it up in today’s call.

I’ll be filling in the user research sections today and should have plenty of time to submit by the deadline Wed!

brunobeltran · July 9, 2020, 2:45pm

Per the dev call on Monday, I revised the above proposal to make sure the focus is clearly on collating the documentation for our “implicit types”, and only then if time allows following through on MEP30 and incorporating this documentation into proper Python class docstrings (where the classes can then hold the validator and potentially serializer logic).

Since there appears to be a cap on how long after a post it can be edited, I will paste my final application proposal below instead of editing the above post (which I no longer have “permission” to do, apparently…?).

Motivation

Historically, matplotlib’s API has relied heavily on string-as-enum
“implicit types”. Besides mimicking matlab’s API, these parameter-strings allow the
user to pass semantically-rich values as arguments to matplotlib functions
without having to explicitly import or verbosely prefix an actual enum value
just to pass basic plot options (i.e. plt.plot(x, y, linestyle='solid') is
easier to type and less redundant than something like plt.plot(x, y, linestyle=mpl.LineStyle.solid)).

Many of these string-as-enum implicit types have since evolved more
sophisticated features. For example, a linestyle can now be either a string
or a 2-tuple of sequences, and a MarkerStyle can now be either a string or a
matplotlib.path.Path. While this is true of many implicit types, MarkerStyle
is the only one (to my knowledge) that has the status of having been upgraded to
a proper Python class.

Because these implicit types are not classes in their own right, Matplotlib has
historically had to roll its own solutions for centralizing documentation and
validation of these implicit types (e.g. the docstring.interpd.update docstring
interpolation pattern and the cbook._check_in_list validator pattern,
respectively) instead of using the standard toolchains provided by Python
classes (e.g. docstrings and the validate-at-__init__ pattern,
respectively).

While these solutions have worked well for us, the lack of an explicit location
to document each implicit type means that the documentation is often difficult
to find, large tables of allowed values are repeated throughout the
documentation, and often an explicit statement of the scope of an implicit
type is completely missing from the docs. Take the plt.plot docs, for
example: in the “Notes”, a description of the matlab-like format-string styling
method mentions linestyle, color, and markers options. There are
many more ways to pass these three values than are hinted at, but for many
users, this is their only source of understanding about what values are possible
for those options until they stumble on one of the relevant tutorials. A the
table of Line2D attributes is included in an attempt to show the reader what
options they have for controlling their plot. However, while the linestyle
entry does a good job of linking to Line2D.set_linestyle (two clicks
required) where the possible inputs are described, the color and markers
entries do not. color simply links to Line2D.set_color, which fails to
offer any intuition for what kinds of inputs are even allowed.

It could be argued that this is something that can be fixed by simply tidying up
the individual docstrings that are causing problems, but the issue is
unfortunately much more systemic than that. Without a centralized place to find
the documentation, this will simply lead to us having more and more copies of
increasingly verbose documentation repeated everywhere each of these implicit
types is used, making it especially more difficult for beginner users to simply
find the parameter that they need. However, the current system, which forces
users to slowly piece together their mental model of each implicit type through
wiki-diving style traversal throughout our documentation, or piecemeal from
StackOverflow examples, is also not sustainable.

End Goal

Ideally, any mention of an implicit type should link to a single page that
describes all the possible values that type can take, ordered from most simple
and common to most advanced or esoteric. Instead of using valuable visual
space in the top-level API documentation to piecemeal enumerate all the possible
input types to a particular parameter, we can then use that same space to give a
plain-word description of what plotting abstraction the parameter is meant to
control.

To use the example of linestyle again, what we would want in the
LineCollection docs is just:

A link to complete docs for allowable inputs (a combination of those found in
Line2D.set_linestyle and the linestyle
tutorial).
A plain words description of what the parameter is meant to accomplish. To
matplotlib power users, this is evident from the parameter’s name, but for
new users this need not be the case.

The way this would look in the actual LineCollection docs is just

"""
linestyles: `LineStyle` or list thereof, default: :rc:`lines.linestyle` ('-')
    A description of whether the stroke used to draw each line in the collection
    is dashed, dotted or solid, or some combination thereof.
"""

where the LineStyle type reference would be resolved by Sphinx to point
towards the a single, authoritative, and complete set of documentation for how
Matplotlib treats linestyles.

Benefits

Some powerful features of this approach include

Making the complete extent of what each function is capable of obvious in
plain text (with zero clicks required).
Making the default option visible (with zero clicks). Seeing default option
is often enough to jog the memory of returning users.
Make a complete description of the “most common” and “easiest” options for a
parameter easily available when browsing (with a single click).
Make the process of discovering more powerful features and input methods as
easy as “scroll down” to see more advanced options (with still only one
click).
Provide a centralized strategy for linking top-level “API” docs to the relevant “tutorials”.
Avoid API-doc-explosion, where scanning through the many possible options to each parameters makes individual docstrings unwieldy.

Other benefits of this approach over the current docs are:

Docs are less likely to become stale, due to centralization.
Canonicalization of many of matplotlib’s “implicit standards” (like what is a
“bounds” versus an “extents”) that currently have to be learned by reading
the code.
The process would highlight issues with API consistency in a way that
can be more easily tracked via the Github issues tracker, helping with the
process of improving our API.
Faster doc build times, due to significant decreases in the amount of
text needing to be parsed.

Implementation

The improvements described above will require two major efforts for which a
dedicated technical writer will be invaluable. The first is to create one
centralized “tutorial” page per implicit type. This will require working with
the core developer team to identify a concrete list of implicit types whose
documentation would be valuable to users (typically, because they contain
powerful, hidden features of our library whose documentation is currently only
found in difficult-to-stumble-across tutorials). For each implicit type, I will
then synthesize the various relevant tutorials, API docs, and example pages into
a single authoritative source of documentation that can be linked to anywhere
that particular type is referenced.

Once the centralized documentation for a given implicit type is complete, the
second major effort begins: replacing existing API documentation with links to
the new documentation, with an eye towards making the experience of actually
using this new documentation as easy as possible, both for those using Python’s
built-in help() utility and for those browsing our documentation online.

While the exact format of the documentation proposed here is subject to change
as this project evolves, I have worked with the Matplotlib core team during
their weekly “dev calls” to establish a consensus that the strategy proposed
here is the most expedient, useful, and technically tractable approach to begin
documenting these “implicit types” (notes on these
calls are available on hackmd).
I will use the existing “tutorials” infrastructure for the initial stages of
creating the centralized documentation for each implicit type, allowing me to
easily reference these pages as follows, without having to create any new public
classes (again, using the LineCollection docs as an example):

"""
linestyles: LineStyle or list thereof, default: :rc:`lines.linestyle` ('-')
    A description of whether the stroke used to draw each line in the collection
    is dashed, dotted or solid, or some combination thereof. For a full
    description of possible LineStyle's, see :doc:`tutorials/types/linestyle`.
"""

Moving forward, we could then easily change how these references are spelled
once the core developer team agrees on the best long-term strategy for
incorporating our new “types” documentation into bona fide Python classes, for
example as proposed in the Matplotlib Enhancement Proposal
30.

Finally, the preliminary list of implicit types that I propose documenting
during this Google Season of Docs are:

capstyle
joinstyle
bounds
extents
linestyle
colors/lists of colors
colornorm/colormap
tick formatters

A living version of this document can be found on our
Discourse.