How to engage more contributors to matplotlib?

SidharthBansal · April 7, 2020, 7:12pm

@story645 asked my opinion on how to increase the long term contributors count at mpl at gitter. Thanks for the question.
These are few things which can help mpl:

Welcomebot: Usage of the welcome bot to encourage people on their first PR submission, first issue submission, first pr merge, etc. The welcome bot has extra friendly messages to encourage contributors to work at organisations. Emoticon usage can also be incorporated for more friendliness.
Detailed and Quick Installation Guidelines: Newcomers faced a lot of problem by scrolling between pages for installation. So, my suggestion would be the inclusion of one page of guideline including installation commands for software including sphinx, matplotlib, conda/venv etc installation. Preferable to give troubleshooting link there as well. I know there are problems with different OS etc. But that can be discussed
More and self-contained first-timer issue: Most of the first-timer issue miss code links, newcomers don’t know how to search for specific features. So, including them will help contributors to commit. Issues should contain proposed solutions too. Currently most issues are already taken up and are fixed.
Prefer friendliness over professionalism initially to attract contributors: Imagine a person who is new to organisation and has no prior experience in the tools used at the organisation. It is exceeding difficult to contribute. Usage of technical jargons makes their life much more problematic. So, motivating folks help.

Thanks

SidharthBansal · April 7, 2020, 7:21pm

@story645 wrote on How to engage more contributors to matplotlib · Issue #17060 · matplotlib/matplotlib · GitHub

Thanks for posting it in one place, but like said mentioned on gitter, I think this conversation would be better suited for Community - Matplotlib
Also can you please give constructive suggestions that we can spin off into PRs? Do we need to update documentation, reviewer guidelines, the readme? We’re happy to adopt these suggestions but help on how would be appreciated.

SidharthBansal · April 7, 2020, 7:23pm

Also can you please give constructive suggestions that we can spin off into PRs? Do we need to update documentation, reviewer guidelines, the readme?

These things are working smoothly. I checked around 50+ PRs to find mpl workflow and find places which are related to gsoc proposal/ testing or related to work i am doing or things which i can do easily as a newcomer here. Most of the prs are reviewed. Contributors left in mid way so most prs are unfinished. Hence, the pr stats are huge. MPL may adopt strategy of closing the prs which are opened for more than 3 months and no active contributor there with a message like We are happy to have your help here. We encourage you to contribute and complete this PR after reopening it in your free time if possible. We are closing this PR now as it has been 3 months. -MPL Team, etc. We can use saved replies feature.
3 months is an example. Can be changed according to needs of mpl.

Secondly, we can use a saved reply saying Are you stuck somewhere? Do you need any help? We will be happy to help you. Thanks for working with us -MLP Team, etc.
Saved replies saves a ton of reviewer’s time

Thanks for asking @story645 my views

timhoffm · April 7, 2020, 11:57pm

@SidharthBansal thanks for sharing your ideas!

Welcomebot and Detailed and Quick Installation Guidelines is definitively something we can improve on. (You’re also very welcome to contribute here if you’re interested.

I’m afraid, More and self-contained first-timer issue is not that easy.

Note: This is my personal view and does not reflect any official project guidelines:

Since Matplotlib is quite old and settled, many low hanging fruits are picked.
Due to our wide use, we need to be extra careful to not break user code, but also to not burden the codebase with features we do not want to maintain in the long run. This makes adding new stuff much harder.
Additionally, I personally see a resource limitation and conflict of interest. To write a self-contained description, I usually have to think this though to an extent to which it would be faster for me to write the patch than to write up what needs to be done. And than you have the additional time to review a first-timer PR. Making more self-contained issue is an investment of valuable core developer time.

I’ve mentioned in an earlier discussion that I’m more concerned with the resource limitation of experienced developer/reviewer time (e.g. we already now don’t manage to burn down the open PRs) rather than attracting new contributors. From the project perspective IMHO we currently can’t cope with many more new contributors . It sounds a bit selfish (as far as it can from someone spending a lot of his free time here) but the project usually does not benefit from people doing just one or two PRs. We’d like to attract people, who want to stay for a while, and maybe even become core contributors. I may be wrong, but I assume that better first-timer issues would predominantly help the former people, while really interested (and interesting) people are willing to invest some more effort.

If we had unlimited resources, I would be all in for a better first-timer integration. But given resource limitations, I don’t see first-timer issues as a top priority.

SidharthBansal · April 8, 2020, 12:22am

First-timer is not a priority right now. Agreed! Until MPL is not running any hackathon or hiring drive, creating dozen of first timers doesn’t make sense.

Also, here I observed mpl needs highly qualified proficient and long term contributors who can stick. MPL requires deep understanding and somewhat breadth of knowledge too. So, aiming at beginners who doesn’t know even git is wastage of development resources.

I agree with self contained issues take developer’s time and developer can invest that time in PR reviews. I also agree with getting short term contributors from self contained first timers issues. One thing mpl can do is opening up of separate issue with summary of change/enhancement/feature needed for first timer instead of adding good first timers tag at end of long discussions. Consider X is highly qualified(developed many projects, published research papers) but has never worked in opensource. So, reading long conversations at the first step will become difficult. Copying and pasting the essence of discussion will hardly take a minute or two. We can omit self-contained part as I agree with you. We can use help of first-timer bot too. It uses friendly templates to welcome newcomers. Easy to integrate, use and no maintenance.

Regarding welcomebot and installation guideline updates, I will be glad to help you all with that. I have many things to read for gsoc approaches discussion. Many of them are suggested in chat by the mpl team on gitter. Once completed with building concepts regarding mpl, I will implement both of them.

SidharthBansal · April 8, 2020, 12:36am

Can mpl close the prs at which contributors are inactive for more than 4 months(time depends on core team judgements)? This will lead to closing of a lot of PRs. When we see that PR tracker has too much PRs we get worried on daily basis. When we see it has fewer numbers then we get little worried.

story645 · April 8, 2020, 4:09am

@QuLogic and @tacaswell are working on sorting out the PR backlog as part of the CZI grant.

When the topic gets messy enough that it’s no longer skimmable, it usually means that there are a lot of viewpoints to distill and summarize. On new PRs, we’re piloting a champion system where a core contributor who is not necessarily reviewing the PR is in charge of seeing a PR get merged or closing it because of how unwieldy the process can get.

SidharthBansal · April 8, 2020, 12:14pm

ok. I was referring to those issues which converges to a single solution but still has many discussed opinions(because they are opened for many years). I agree with the fact that issues with different viewpoints can’t be copied to a new issue. Thanks for deeper thought Hannah on this!

tanim.islam · April 9, 2020, 5:35am

I think the biggest barrier to engage more developers is matplotlib’s crappy architecture (let’s be honest, why are annotations, for instance, still so terrible?) and fairly stupid design. I don’t have a good idea on how to fix this, except with lots of money or lots of time. Maybe we will be lucky enough to fix matplotlib’s problems before something better comes along.

Ernest · April 9, 2020, 11:40pm

I guess it would be nice to know why you think annotations are terrible @tanim.islam. Is it the API? Or the appearence? Please feel free to open a new topic on that.

In general, I think one major problem is that the following: Highly qualified people can come from two sides: a) people with a programming background - for those matplotlib is unattractive because it’s organically grown over more than 10 years, which makes it hard to implement novel ideas and concepts, which are at the heart of what a programmer wants to do.
b) people with a scientific background - those mainly want to get their problem solved and are often motivated to stick to it, but then leave once that has converged (for the better or the worse).

Maybe it’s still interesting to see how contributions compare with other projects, for example vega-lite. One significant difference may be that matplotlib provides the full stack from the user facing API to the rendering, which developpers need to understand to a certain degree, while other projects are much more layered, such that they a) won’t get hit by the full stack of issues and b) can solve problems in a more isolated environment.
E.g. in the example case of vega-lite, the majority of computer science students will not open an issue at vega, but rather with altair as one of the python interfaces to it, making it easier for everyone tackling the issues in a concealed environment.

brunobeltran · April 10, 2020, 3:35am

To add a data point, my interest in contributing to matplotlib comes primarily from a slightly neurotic desire to make sure that the plotting code I write spit out exactly what I want, even if that’s significantly more inefficient than just solving my problems “in post”.

My impression is that the main places we could improve are:

making sure the PR checklist links directly to the code that needs to be run in order to check that each step is complete.
a) Instruct users directly to run pytest/flake8/pydocstyle. (especially pydocstyle, to avoid wasting maintainer time, as discussed in the last dev call).
b) Point users to the correct place for API change docs (see my suggestion below for an example new checklist):

## PR Checklist

- [ ] Has Pytest style unit tests (and `pytest lib/matplotlib/tests` passes!)
- [ ] Code is [Flake 8](http://flake8.pycqa.org/en/latest/) compliant (run `flake8` on changed files to check)
- [ ] New features are documented, with examples if plot related
- [ ] Documentation is sphinx and numpydoc compliant, and follows matplotlib style guidelines (run `pydocstyle` on changed files to check)
- [ ] Added an entry to doc/users/next_whats_new/ if major new feature (follow instructions in README.rst there)
- [ ] Documented in doc/api/api_changes_[VERSION] if API changed in a backward-incompatible way (follow instructions in README.rst there)

Have an official way to assign “champions”.
I realize this may be currently unfeasible, as the majority of maintainers likely don’t have time to commit to following a bunch of newbies PRs, but I was lucky enough to have @anntzer come and show real interest in my PRs initially, and without him directly advocating for me in the Gitter, I doubt I would have ever gotten the opportunity to productively contribute all the code I had written to the mainline.
Getting the sense that I did have a “champion” early on was definitely invaluable, even just from a morale perspective, and knowing that it was “official” would have made things significantly less intimidating.
The process for making image tests is…difficult. I look forward to the results of the GSOC.
The developer-facing documentation could be greatly improved. Some of the issues I have, broadly speaking are, e.g.
- While the core developers obviously have a good mental picture of matplotlib's overall architecture (e.g. how backends get passed artists, the idea that collections are there to speedup the backend, etc), these architectural details are not written down anywhere, forcing new developers to just “read the code”. This works fine enough, since matplotlib is small enough as a library, but is probably a pain point for most new devs. If there was at least an incomplete page outlining the contents of the library as a whole (in readable form, and not just as a list of opaque module names), that would probably help?
- The existing internal-facing docs as a whole are designed to tell you “exactly what you need to know”, and not to teach you how use the internals (via tutorials, or even just by being verbose when it would help). The terse language (especially when it leads to me needing to jump between different docstrings to deduce the functionality of a method) strongly discourages me from digging into the library.
- Concretely, the previous issue might be slighly alleviated if for any Object with an ObjectBase, we simply allowed Object to get a dedicated page in the API docs somewhere so that all the methods of Object (including those defined by ObjectBase) can be documented in one place. The module level docs in that case could focus on outlining the differences between the different subclasses of ObjectBase as succintly as possible, instead of just listing all the methods. The dependency diagrams (such as in the transforms docs) are cute, but largely useless to a new developer who doesn’t understand what each node actually is and why it was architected that way in the first place. (Especially since my IDE already shows them to me as I’m coding).
- Internal facing docs often have small errors (easy to fix) and seeming API inconsistencies (hard to/won’t fix). A good example of the kind of errors that still exist is Path.get_extents returning a Bbox, but documenting that it returns extents (fixed in #16832). A good example of inconsistency is Bbox.union taking a list, and Bbox.intersection taking one argument per Bbox.
  I would love to improve these things, but because code-churn is so actively discouraged and I am a very new contributor, I would not feel comfortable just opening a PR (especially one that’s just full of doc rewrites) without someone telling me explicitly that I should start doing that kind of thing, and Issues opened about API consistency seem to be encouraged but largely ignored (e.g. #16747).

Some of the things that are already done very well:

people largely responded with very positive feedback, and any criticism I received always felt very constructive.
someone always poked their heads into my PRs within a day or so, increasing the feeling that the library was “accepting” contributions.
the developer docs felt like they did technically have everything I needed to know, although the PR checklist was actually more helpful/useful/readable.