Improving feature discoverability by centralizing documentation of "implicit" types

Abstract

Matplotlib is the de facto standard library atop which the bulk of the Python visualization ecosystem is built, including such popular packages as cartopy, pandas, seaborn, and many others.

Over the years, Matplotlib has evolved to contain a stunning array of features, however discoverability remains a central problem for our documentation.
There is a vast existing store of matplotlib documentation that can be broken
up largely into tutorials, API documentation, and examples, (ignoring “meta” documentation, e.g. installation how-to’s, contribution guides, etc.).
However, while anything not covered in detail with the API documentation is likely to have a tutorial or example written about it, it seems that even amongst core developers the best way to figure out the “latest and greatest” way to accomplish a task is often to ask the community.

In order to remedy this problem, I propose to integrate the API documentation with a large swath of existing tutorials and examples by attempting to systematically document what I will call Matplotlib’s “implicit types”.
These are a set of core concepts which, due to their “duck type”-heavy nature, have not yet merited their own proper class, but nonetheless are commonly accepted as inputs to a variety of Matplotlib routines internally, and whose documentation is currently scattered throughout docstrings, tutorials, and sometimes even just in examples.

The goal is to create a path to discoverability of Matplotlib’s core features for one of our largest user groups: the so-called “copy/paste/edit” developer (see the User research section below).
We will leverage our existing Sphinx/Readthedocs infrastructure by simply creating a new reStructuredText role :type: which can be used in place of the usual numpydoc type syntax when appropriate.
For example, in Line2D:

"""
color : :type:`Color`, default: :rc:`lines.edgecolor`
"""

would link to a comprehensive “Color” tutorial (currently links to
Line2D.set_color).

The goals of this change are to

  1. Make it to find the “common”/“easiest” option for a parameter (preferably in zero clicks).
  2. Make it easy to “scroll down” to see more advanced options (preferably with one click, max).
  3. Provide a centralized strategy for linking top-level “API” docs to the relevant “tutorials”.
  4. Avoid API-doc-explosion, where scanning through the many possible options to each parameters makes individual docstrings unwieldy.

User research

Because matplotlib users are such a diverse group, creating documentation that
works for everyone—from those for whom matplotlib is their introduction to
Python to seasoned data scientists and library writers hoping to wrap
matplotlib into their own packages—is a herculean task. In order to identify
what unmet needs are most relevant for different kinds of users, I watched
several matplotlib users undertake assigned tasks using the library, where the
users were split into three different categories of experience with matplotlib
and Python:

  1. Experienced with matplotlib and Python
  2. Experienced with Python, but not plotting or matplotlib specifically
  3. Experienced with design, but with neither Python nor matplotlib

TODO: Fill in this section with full write-up.

The main takeaway was that both types 1 and 2 are “copy/paste/edit” developers
who do the majority of their editing with the docs pulled up, and only discover
features by googling in plain english when they have a specific task they want
to accomplish that they don’t immediately found in the API docs (or in the docs
for the example they are copying from).

Detailed proposal

Historically, matplotlib’s API has relied heavily on string-as-enum
“implicit types”. Besides mimicking matlab’s API, these parameter-strings allow the
user to pass semantically-rich values as arguments to matplotlib functions
without having to explicitly import or verbosely prefix an actual enum value
just to pass basic plot options (i.e. plt.plot(x, y, linestyle='solid') is
easier to type and less redundant than plt.plot(x, y, linestyle=mpl.LineStyle.solid)).

Many of these string-as-enum implicit types have since evolved more sophisticated
features. For example, a linestyle can now be either a string or a 2-tuple
of sequences, and a MarkerStyle can now be either a string or a path. While this
is true of many implicit types, MarkerStyle is the only one (to my knowledge) that
has the status of being a proper Python type.

Because these implicit types are not classes in their own right, Matplotlib has
historically had to roll its own solutions for centralizing documentation and
validation of these implicit types (e.g. the docstring.interpd.update docstring
interpolation pattern and the cbook._check_in_list validator pattern,
respectively) instead of using the standard toolchains.

While these solutions have worked well for us, the lack of an explicit location
to document each implicit type means that the documentation is often difficult to
find, large tables of allowed values are repeated throughout the documentation,
and often an explicit statement of the scope of a implicit type is completely
missing from the docs. Take the plt.plot docs, for example. In the “Notes”,
a description of the matlab-like format-string styling method mentions
linestyle, color, and markers options. There are many more ways to
pass these three values than are hinted at, but, for many users, this is their
only source of understanding about what values are possible for those options
until they stumble on one of the relevant tutorials. In the table of Line2D
attributes, the linestyle entry does a good job of linking to
Line2D.set_linestyle where those options are described, but the color
and markers entries do not. color simply links to Line2D.set_color,
which does nothing in the way of offering intuition on what kinds of inputs are
allowed.

… It can be argued that plt.plot is a good candidate to be explicitly
excempted from any documentation best practices we try to codify, and I’ve
chosen it intentionally to elicit the strongest opinions from everyone.

It could be argued that this is something that can be fixed by simply tidying
up the individual docstrings that are causing problems, but the issue is
unfortunately much more systemic than that. Without a centralized place to find
the documentation, this will simply lead to us having more and more copies of
increasingly verbose documentation repeated everywhere each of these implicit
types is used. The alternative, of scattering the information throughout the
documentation, will instead lead to the users having to slowly piece together
their mental model of each implicit type through wiki-diving style traversal
throughout our documentation, or piecemeal from StackOverflow examples.

Ideally, a mention of linestyle in the LineCollection docs should
instead link to the same place as it does in the plt.plot docs. By
organizing these linestyle-specific docs in order from most-common to
most-complex input types, we can maintain a “single-click-to-discover” property
for our advanced plotting options, while also making sure that we don’t hurt
usability for users that simply want to know the simplest way to accomplish a
common task.

Practically speaking, the actual information that we want to have in the
LineCollection docs is just:

  1. A link to complete docs for allowable inputs (like those found in
    Line2D.set_linestyle).
  2. A plain words description of what the parameter is meant to accomplish. To
    matplotlib power users, this is evident from the parameter’s name, but for
    new users this need not be the case. (e.g. linestyle: a description of whether the stroke used to draw each line in the collection is dashed, dotted or solid).
  3. A link to any tutorials that visually depict the possible options (currently
    found only after already clicking through to the Line2D.set_linestyle
    docs).

In order to make this information available for all implicit types, helping the
continued improval of the consistency and readability of the docs, we propose
the following best-practices for handling implicit types:

  1. Implicit type documentation should be centralized at a dedicated page, where
    the easiest/most common/simplest options are plainly documented in a
    separate section on the top of the page, and more advanced options can be
    found by simply scrolling down.
  2. Functions that accept implicit types as parameters should link to the
    appropriate :type: docs.
  3. If a implicit type is a “string-as-enum”, it should simply be made an
    Enum, and each possible value should have a Sphinx-parseable documentation
    string.

In particular, notice that (1) would replace large copies of tables of possible
linestyles, markerstyles, etc, with links to the complete documentation for
each. Without all the visual noise from these tables of valid options, the
relevant functions would be free to visibly link to tutorials where these
options are visually demonstrated.

The way this would look in the actual docs is just

"""
linestyles: :type:`Linestyle` or list thereof, default: :rc:`lines.linestyle`
"""

would link to a comprehensive explanation of what “linestyles” are allowable,
similar to what is currently found at
:doc:`/gallery/lines_bars_and_markers/linestyles.html` , (currently does not link to anything at all!)

Some benefits of this approach include:

  1. Less likely for docs to become stale, due to centralization.
  2. Increased discoverability of advanced options. If the simple linestyle option
    '-' is documented alongside more complex on-off dash specifications,
    users are more likely to scroll down than they are to stumble across an
    unlinked-to tutorial that describes a feature they need.
  3. Canonicalization of many of matplotlib’s “implicit standards” (like what is a
    “bounds” versus and “extents”) that currently have to be learned by reading
    the code.
  4. The process would likely highlight issues with API consistency in a way that
    could be more easily tracked via Issues, helping with the process of
    improving our API (see below for discussion).
  5. Becoming more compatible with potentially adding typing to the library.
  6. Faster doc build times, due to significant decreases in the amount of
    text needing to be parsed.

Implementation

This proposal would create one centralized “tutorial” page per implicit type.
For types with complex construction requirements, we would produce and use
classmethods for explicit construction from a known type, but __init__
would continue to hold the logic required to deduce how to construct the type
from the type of the input.

All functions that accept this implicit type as a parameter would have their
docstrings changed to simply use the numpydoc “input type” syntax to link to
this new class. All functions which use this implicit type (i.e. would raise on
an invalid input) would construct an explicit object instance using the general
__init__, allowing the new class to handle validation.

The implicit types that I propose require better organized tutorials

  1. capstyle
  2. joinstyle
  3. bounds
  4. extents
  5. linestyle
  6. colors
  7. colornorm/colormap
  8. ticks
  9. Probably others…

Related issues

Some common discoverability issues that this proposal does not address involve
parameters whose allowable types depend on other parameters (for example
x and y in plt.plot depending on data.

Alternatives

I submitted a similar proposal as MEP30, which instead of simply adding
documentation, actually proposes to make each of these concepts into a new
style class, so that much of what is effectively tutorials documentation
effectively becomes just API documentation, which can be linked using the
standard numpydocs conventions for types of parameters.

Timeline

//TODO

1 Like

This proposal was original submitted as MEP30, but @tacaswell and @story645 suggested I submit it as a GSOD proposal here.

I created this now so that I can start to get feedback, but will continue to edit the original post slightly as I finish writing up my user research and incorporating people’s suggestions to how to do MEP30 better. My hope is that after incorporating feedback, the original post here will be a fully-fledged GSOD proposal, ready to submit.

I am happy to mentor anybody for whom this proposal sounds interesting, but otherwise I will be doing this work anyway, so it would be great to get paid for it :wink:

For writing samples please see any of the papers I have published in my CV:
http://brunobeltran/files/brunocv.pdf

In particular:

Or any of my major PRs to matplotlib, including:

  1. everything linked to from #16891
  2. my take on the Collections docs (#17575), which inspired this proposal

@brunobeltran think this is a strong proposal, please make sure to finish submitting it to google before the deadline.

I have some slight concerns about the exact proposed spelling of the :type: sphinx role and that the documentation work will get side-tracked fixing the inevitable miss-matches we are going to find.

Thanks @tacaswell, I’ll include some language about limiting the scope of the proposal to actually defining these matplotlib “concepts”, since while writing up the documentation (based on reading through the code so far) it will (usually) be easy enough to just open an issue whenever I find a mis-match, instead of trying to fix it completely.

I am definitely still open to suggestions RE: the spelling of the :type: Sphinx role (after giving it more thought, maybe having it be called :concept: would make more sense?) I’ll bring it up in today’s call.

I’ll be filling in the user research sections today and should have plenty of time to submit by the deadline Wed!