You are here

Large Git repositories

Benjamin's Personal Blog - Sun, 15/12/2013 - 7:48pm
A little while back on the Git mailinglist Facebook started a discussion about performance issue when using a large repository. The devs were a little vague on specifics, but it was pretty clear that the problem stems from the fact that every Facebook project is currently in one repository and all of the projects are interdependent at the moment.   Maybe, just maybe Facebook has a real reason for having all of the projects together in one repository, but my money is that it is a historical accident of not cleanly separating their code so they don't have a choice right now but to try to stuff it into one Git repo.   Like many before them they would like to improve Git to scale to their requirements rather than split their repository.

The past few years I have hacked on Git and Git servers quiet a bit and ended up discussing with a number of different companies this very problem.   I even wrote a little git command to execute a Git command in multiple Git directories at the same time called git-map.  In almost every case similar to Facebooks the company built a mountain of hairy inter-dependent projects and code. From a source control standpoint this worked fine when everything is on the server (no big downloads) and using a tool like Perforce where everything is read-only and you have to mark the file you are editing before you modify it (no long stat of every file).

But importing this type of setup into Git results in repositories that can easily be tens of GB's and commands that take minutes to execute.  The various companies all told me that the big hairy ball was good for them and splitting up would be annoying because other than Git if offered no other real reward.  But upon investigation I typically find the repository to be exactly what it sounds like, a big hairy ball and even if they didn't switch to Git cleaning up with repository would have a big improvement in development.

The big hairy repo problem can result in all of these problems (and more!):
  • Usually devs can't build only their little project, but have to build the entire world.  This increases the "code, build, test" cycle from seconds to maybe hours.  Every once in a while someone comes along and speeds up the build by buying some fast distcc servers, or upgrades everyone's desktop with ssds, but little by little it slows down and X months later it is back to "slow".
  • When every developer in a company is committing to the same branch the odds that a commit will break the build increases as more developers are hired.
  • Because the branch fails to build devs end up committing without building or fail to sync for weeks at a time once they find a build that works.  (If they are lucky a bot is put in place that builds and commits for the devs which seems the only solution I know of to solve the make world problem.)
  • Because the API between projects can change at any time developers create API's without much thought typically resulting in bad API's.  Bad API's later are changed (without warning) breaking any code that wasn't updated with the API change.  The lack of an incentive for a stable API makes it much harder to release new projects on old code bases and vice vera.
  • Bisecting the repo for bugs is so difficult it is something devs do only as a last resort causing lost time.
  • Internal API promisses get broken in the rush to release. Project A happens to include the private header so it can muck with project B's internals to get what it needs. A "TODO" comment is added, but it is never fixed.  Imagine the worst thing you can do in your language of choice... somewhere someone it doing that right now.
  • Unlike Git making branches is usually very hard (This is a detail of the current software typically used today, perforce and svn) and so almost no one does. This means there are big efforts underway before a release to "make things stable again". The overall health of a project is hard to determine and release date planning becomes harder.
  • There are big "build teams" that release the whole software package. The idea that your project could pick which change gets in a release is laughable. Because branching and merging are hard minor fixes get into release branches and increase regression count.
An odd thing in common with all of these issues is that they all can cost developers time which is the most valuable commodity.  You would think developers would be all over trying to solve them, but it just isn't a priority.
But what about the atomic problem?  In Git repo A I add a new API.  In Git repo B I add code that uses the new API.  Users that sync B, but not A will have a build failure.  The obvious and long solved answer to library API's is versions.  If all of your internal projects have no versions this can make using lots of little Git repos a pain to deal with.  When you split all of the internal projects into different repositories it is best to do the following.
  1. Teams tag when their project is in a releasable state and have versions.
  2. The build teams actual job is to create a distribution of the various project releases.
Lucky for us the Linux community has been doing this the past two decades and there is a long line of historical lessons for us to use.  Projects like Debian show how to have source and binary packages, dependency tracking, versioning, importing from git, svn, tarball, patching 3rd party source, and packaging them all into one distribution. And they toss for free the ability to upgrade and downgrade packages on the fly. And don't forget build and package servers which are also there for free and open source.

Sadly all of the work like Debian and others seems to be ignored in favor of submodules or a one off solution like repo.  And who is to blame them when I too can write a script that loops over every git repo running "./configure && make install".   What could go wrong?  On the flip side  Debian isn't exactly light, there might be an opportunity for a project to come along that is Debian like, but easy to incorporate in every day software development.