splitting up a "big" git repo, take two

rjbs on 2007-09-11T01:37:13

Converting to git has been fun, educational, and annoying, at various times. Here are some notes (mostly to myself) on the fun, educational, and annoying bits that I solved today.

When I converted my personal Subversion repository (notes on my RPG, talks, letters, config files) to Git, I just turned one big Subversion repo into one big Git repo. It was about 400 megabytes, in Git.

I decided, a little later, that there was no reason that all my huge Keynote slides needed to be in the same place as my letters to my grandmother or my notes on my RPGs. A little digging made it seem like this would work just fine:

$ git filter-branch --subdirectory-filter letters HEAD

With that, my history would be rewritten to only include changes in the letters directory, which would become root. When I did it, I found that everything was still there -- but listed as a new or modified file. That was easy to deal with, too:

$ git reset --hard

Great! Now I had only my fifteen letters with twenty total revisions. Only one small problem:

$ du -sh ../letters
381M letters

$ git-gc && git-prune
$ du -sh ../letters
381M letters

I couldn't figure out what the hell was going on, mostly because I didn't know much about how git stored things. More importantly, I didn't realize that when cloning a git repository from one directory to another on the same filesystem, git-clone will make hardlinks. Then, when cleaning up, it will see that files are in use and not purge them. They're not actually taking up any more space, and after I had removed the original, big repository that I'd cloned, a cleanup would have worked.

I realized this later, when reading the docs because of another problem. I had long since removed the trimmed-down repositories, because I thought they were taking up too much space. To deal with the problem from the get-go, I did something like this:

for dir in talks/exporter rpg/deliverance letters;
  do (
    cd $dir
    git filter-branch --subdirectory-filter talks/$dir HEAD
    git reset --hard
    cd ..
    git clone --no-hardlinks $dir _$dir
    cd _$dir
    git gc --aggressive
    git prune
  )
done

The important thing here is the --no-hardlinks option to git clone. Obviously, it prevents hardlinking, instead making copies of everything. With that done, the gc and prune commands can work as I had expected, removing all the objects not used in this particular repo.

More and more, I find that git is a really well-designed system. I wish I had a reason to learn much more about it!


Similar problem, not as good a solution

david.romano on 2007-09-11T04:42:39

Thanks for the write-up: I was trying almost the same thing! I had imported a pretty big repository (made up of various subprojects) to git, which is sorta what you described a bit ago. I had also renamed some directories along the way, so when I tried to separate them up with git-filter-branch, it was failing with collecting all the revisions needed.

I ended up writing two helper scripts to solve my problem, and I wish I hadn't glossed over --no-hardlinks because that would have saved quite a bit of time. On the other hand, I found out by cloning the git repo twice (and doing git gc) has the same effect as --no-hardlinks. At any rate, these two scripts worked well enough do what I needed to do. I'll most likely change them in the future to integrate --no-hardlinks.