Minor re-writing of history and moving from master to main for default branch

It was noted on gitter that [ENH]: data kwarg support for mplot3d #20912 by jayjoshi112711 · Pull Request #20951 · matplotlib/matplotlib · GitHub included a commit that included most of a virtual environment. This was corrected in the following two commits, however the commits were not squashed so that the files (~40MB worth) are still in the history and and includes several compiled files (hat tip to @anntzer.lee for noticing this). On the down side we have 250+ commits on master, but on the bright side these commits came in after we branched 3.5.x so we do not have any tags that include the bad commits.

Given the number of commits, I think it is too late to just force-push the commits out of existence (as we did in Removed commits from master branch), however I propose that we look at this as an opportunity to rename our default branch from mastermain. My proposal is:

  1. we use bfg (or on equivalent git filter, but bfg is simpler to use) to remove the files we do not want. Fortunately the filenames / folders names are sufficiently unique in our history than we can trivially remove them
  2. we push the cleaned branch to github as main and switch the default branch to be main. This should (if I understand the GH tools correctly) will re-target all open PRs. Anything opened before #20951 should “just work” as they were never aware of those commits. Anything that was opened (and not merged) inbetween will need to be rebased / cherry-picked to remove the extra commits (as they will suddenly show they have ~250 additional commits due to the re-writing)
  3. we remove the master branch from GitHub

There will need to be some coordination to make sure that between steps 1 and 3 in time no one merges anything to master, but our merge rate is low enough and we can check our work sufficiently to make sure that it infact did not happen (and if it did fix it).

The added work to move from master → main is

  • find-and-replace in on the code base to update both the docs and anyplace where the branch name is hard-coded into CI etc
  • document how to checkout and create a main branch for users who already have a clone
  • document how to fix up a feature branch that forked after the PR was merged

From some quick experimentation just rebasing a branch is net enough (as git will very cleverly just preserve the commits we do not want!). There are (as of now) 28 open PRs : Pull requests · matplotlib/matplotlib · GitHub that were created after those commits went into master (but some of which were probably branched before the problematic commits) so I think the option are:

  1. document how to use cherry-picking or interactive rebasing to get rid of the commits (I think git rebase main; git rebase -i would do the trick or git cherry-pick SOME...RANGE)
  2. ask everyone to run bfg and force push
  3. One of us (me) does 2 and force-pushes on behalf of everyone with an open PR

I suspect we should document all 3 and ask people which they want to use for their PR (from eyeballing it 2/3 of the PRs are from core developers).


The invocation to clean the repo is (following Removing sensitive data from a repository - GitHub Docs see links there for install instructions):

bfg --delete-folders '{share,bin,python3.9}' --delete-files pyvenv.cfg

I have run this and am pushing the results to GitHub - tacaswell/matplotlib at main

It is my understanding that this operation is deterministic so anyone should be able to re-run this and verify my work.

I suspect that there might be a way to fully drop those three commits. I’m happy to just make them empty, but if someone wants to sort out how to drop them and advocate for that I would not be opposed.


Commits to be effectively removed (links may break once we drop the master branch and GH cleans their history)

After running bfg we get (note these to to my fork)

which are notably all empty (showing that this worked)

This all seems reasonable, though I’m not quite sure why you want to do the rewrites and change to main at the same time; they don’t seem related.

I don’t think we have that many PRs by “new” contributors since this happened, so I think the pain will be pretty minimal. git rebase -i seems the easiest option to me since we know the commits to remove. However, from the above I’m not quite sure if that is the same as using bfg which seems to be making new empty commits, whereas rebase -i and removing the offending commits will remove the commits altogether with no replacement?

If you’re rewriting, which breaks all commit references after the rewrite, why keep the empty commits? This can be done with filter-branch, and I think it’s easier than using bfg.

$ git checkout -b main master
# Skip the three commits that add then remove files by changing parents
# to skip them.
# This only works because the unwanted files are immediately removed.
# Otherwise we'd have to use a much more complicated filter or bfg.
$ git replace --graft fa982f03eb4c1d01f229a48a40ecf98ad4a7ac8e 47077d108a19682af9cf989d2d992b396e32c5b8
# Make the graft permanent.
$ git filter-branch 47077d108a19682af9cf989d2d992b396e32c5b8..main

I have become convinced that @QuLogic 's suggesting is best. Users then have (at least) 2 options for how to do the rebase. In all cases the need to do:

git checkout --track origin/main

option 1

# for each active branch
git checkout your_branch
git rebase --onto=main master  # assuming you still have master
git push --force-with-lease

option 2:

# just once, this hides the commits we want to exclude for your local git
git replace --graft fa982f03eb4c1d01f229a48a40ecf98ad4a7ac8e 47077d108a19682af9cf989d2d992b396e32c5b8
# for each branch
git checkout your_branch
git rebase main
git push --force-with-lease

Other options include cherry-picking a range of commits and playing games with soft-resets and re-committing.

though I’m not quite sure why you want to do the rewrites and change to main at the same time; they don’t seem related.

So we do not have to force-push to master. This will avoid a whole class of issues where people do git checkout master && git pull. If we re-wrote history on the master branch many people will end up in this situation where they have merged “new” master into “old” master. While the fix for this is straight forward (git reset --hard origin/master) it is not part of most peoples workflow (personally I almost never use it, but rely on pull + ff-only on merge so that if I accidentally put something on the default branch I get errors rather than extraneous merge commits or blowing away work I might have wanted to keep) and I think that this would be an issues that would linger for a while beyond the currently open PRs.


The git docs on recovering from an upstream rebase is pretty good: Git - git-rebase Documentation what we are doing

One other note, we will need to do this in two steps:

  1. rename master → main on GH to get the automatic PR migration etc.
  2. force-push the fixed main to GH

We also noted on the call that

git rebase -i main

will give the use 3 extra commits + their commits as another option to manage the re-base.

A proposed template for a message to the affected users:


This PR is affected by a re-writing of our history to remove a large number of accidentally committed files [see discourse](https://discourse.matplotlib.org/t/minor-re-writing-of-history-and-moving-from-master-to-main-for-default-branch/22354/6) for details.

To recover this PR it will need be rebased onto the new default branch (main).  There are several ways to accomplish this, but we recommend (assuming that you call the matplotlib/matplotlib remote `"upstream"`

```bash
git remote update
git checkout main
git merge --ff-only upstream/main
git checkout YOUR_BRANCH
git rebase --onto=main upstream/old_master
# git rebase -i main # if you prefer
git push --force-with-lease   # assuming you are tracking your branch
```

If you do not feel comfortable doing this or need any help please reach out to any of the Matplotlib developers.  We can either help you with the process or do it for you.

Thank you for your contributions to Matplotlib and sorry for the inconvenience. 

Typo: any help, please

edited the previous post to fix.

For confirmation, I ran the commands I listed on current master:

commit f6e0ee49c598f59c6e6cf4eefe473e4dc634a58a (origin/master, origin/HEAD, master)
Merge: 9fb5370dc2 f96c73c1ab
Author: David Stansby <dstansby@gmail.com>
Date:   Wed Oct 20 11:18:25 2021 +0100

    Merge pull request #21389 from jatin837/unused-pxl-coord
    
    Log pixel coordinates in event_handling coords_demo example on terminal/console

which now looks like this:

commit a1eef38f6f5a8acccc49f3b54ac429b04d8af15c (HEAD -> main)
Merge: df12d8caae 8955d02c56
Author: David Stansby <dstansby@gmail.com>
Date:   Wed Oct 20 11:18:25 2021 +0100

    Merge pull request #21389 from jatin837/unused-pxl-coord
    
    Log pixel coordinates in event_handling coords_demo example on terminal/console

15:54:38 $ git push DANGER --force-with-lease main:main
Enumerating objects: 5941, done.
Counting objects: 100% (3462/3462), done.
Delta compression using up to 8 threads
Compressing objects: 100% (1300/1300), done.
Writing objects: 100% (2876/2876), 766.75 KiB | 21.30 MiB/s, done.
Total 2876 (delta 1983), reused 1962 (delta 1564), pack-reused 0
remote: Resolving deltas: 100% (1983/1983), completed with 296 local objects.
To github.com:matplotlib/matplotlib.git
 + f6e0ee49c5...a1eef38f6f main -> main (forced update)

In addition to renaming the default branch we have temporarily put the old master branch back on github as old_master

16:07:29 $ git push DANGER old_default:old_master
Enumerating objects: 7761, done.
Counting objects: 100% (5273/5273), done.
Delta compression using up to 8 threads
Compressing objects: 100% (2967/2967), done.
Writing objects: 100% (4696/4696), 19.01 MiB | 2.25 MiB/s, done.
Total 4696 (delta 2099), reused 4242 (delta 1674), pack-reused 0
remote: Resolving deltas: 100% (2099/2099), completed with 295 local objects.
remote: warning: See http://git.io/iEPt8g for more information.
remote: warning: File 63686374be9f8e2524004641a65752fae32b4c7d is 62.26 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
remote: warning: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
remote: 
remote: Create a pull request for 'old_master' on GitHub by visiting:
remote:      https://github.com/matplotlib/matplotlib/pull/new/old_master
remote: 
To github.com:matplotlib/matplotlib.git
 * [new branch]            old_default -> old_master


Rebase worked for me. Only issues with the instructions:

  • my “main” is named “placeholder”.
  • I had to do git push --force-with-lease origin YOUR_BRANCH

EDIT: I guess I’m not sure what “if you are tracking your branch” means.

Thanks for working on this, I’m sure its nerve-racking!

In addition to all of PRs that were correctly migrated over the following PRs were closed:

The forks that these PRs came from have been removed so GH could not automatically moved them. It is still possible to recover the code in these PRs by checking out the commits via PR references that GH provides if someone is interested. Elliott has marked them all as orphaned.

At this point @QuLogic and I believe that the primary work is done (with all of the rebasing left to do).

Please report any other issues you encounter!

Hi team,

I’m addressing this in relation to PR #21338 and having some trouble. I think my workflow is the usual github one, so raise it here in case it’s a general problem. So to make this PR I:

  • Forked matplotlib upstream to my own profile (i.e. origin)
  • Pulled to local machine, and created a new feature branch, axeslabels
  • commited axeslabels to origin
  • opened PR from my fork’s axeslabels branch to matplotlib branch (now main)

So following the instructions on the local machine, git checkout main doesn’t work as there is no main in remote so trying git fetch upstream and then git merge --ff-only upstream/main doesn’t work: fatal: Not possible to fast-forward, aborting. But git rebase upstream/main succeeds, and so doing git checkout axeslabels and then git rebase --onto=main upstream/old_master I thought would work, but yields fatal: Does not point to a valid commit 'main'.

So I’m a bit stumped now!

Does

git checkout -t upstream/main

work to create a local main?

It may also be worth trying:

git rebase --onto=upstream/main upstream/old_master

which does not rely on your (other) local branches.

Fantastic @tacaswell , thank you. If you don’t mind just confirming, but this sequence seems to have worked and has now updated the PR from the fork:

git checkout -t upstream/main
git checkout MY_FEATURE_BRANCH
git rebase --onto=main upstream/old_master
git push --force-with-lease origin MY_FEATURE_BRANCH

Thanks for the help!

1 Like

I have now deleted the branch old_master on main repo. The last commit it was at was f6e0ee49c5.