The Evolution of Social Coding with Git

This is a rough set of ideas about how open source software has changed in the last decade or so, and how it's affected our ability to work with the tools Git provides. It's not based on solid research or anything useful like that. Just complete and total speculation on my part. Feel free to flip tables, disagree at me on Twitter (@emmajanehw), and write your own response piece about how wrong I am.

From Patching Work Flow to Social Coding

When I first started contributing to open source projects (a little over a decade ago) there were two things I needed to learn when it came to version control: CVS, and reading a patch file which was attached to an email. Most of the projects I worked on used CVS, and as a result I taught my students how to use CVS as well. Most of the open source projects I followed, however, used a "patching" work flow. Indeed, Drupal still works this way.

As a result of the patching work flow, I became very comfortable reading diff files (and as a result, patches). It wasn't a lot of fun, but it was tedious work. (If you're not at all familiar with these tools, you can peek at Introduction to Using Diff and Patch for a decent overview of what was involved.) There are even tracking systems designed to work in conjunction with mailing list to capture patches sent to a mailing list and publish them to a web interface.

In this work flow, it makes the most sense to submit your idea when you think you're close to a solution (or perhaps you're stuck and need help). The granularity of what you submit would not include the micro steps you took to get to this point; rather, it would simply include the completed changes you made against the original version of the files you were working from. To support this work flow, there is even a git command, am which splits email messages in a given mailbox into a commit message, authorship information, and the patch file itself. (You can read the man page for the command to get a better sense of the types of information it can process.) Indeed the creator of Git, Linus Torvalds, is notoriously picky about his commit messages.

Some modern projects still use a patching work flow. Yes, dealing with patches is still tedious work as the patches need to be kept up-to-date by re-creating the files. (There is a very long guide on how to do this. Skim it to see how annoying this process is; don't worry if you don't understand all the steps involved as it's not important for this particular rant.) By using a patching work flow, however, the developers are still working with whole ideas when they apply changes to their software.

If we take a look at Git itself, many of the tools it gives us are optimized for dealing with whole ideas, and also creating whole ideas.

  • Bisect: A binary search of past commits which allows you to determine when a bug was introduced. Git temporarily reverts your code by individual commits and allows you to mark a commit as "good" or "bad". To be effective, every commit you're testing needs to have a working version of your code. If, for example, you have a commit which is "stopped working for the night" and the code is not actually in a functional state, you end up wasting time finding the useful commits where the code did actually work. A great example of how bisect works in practice is available here.
  • Rebase: Apply commits to a branch as if they happened in a different order than what the history suggests. This feature also allows you to combine (or separate) individual commits. By combining (or "squashing") multiple commits into one, you can come closer to the idea of a "whole idea" which I discussed previously. Rebasing is essentially re-drawing the graph of how a history of commits was performed. (Conceptually I hate rebasing. It feels like revisionist history. I also understand why it is a necessary tool and do use it when it is appropriate to do so.)

So we have these two concepts that are optimized for working with whole ideas. If we look at social coding today though, the experience is actually optimized on the conversation it takes to get to a whole idea. If you take a look at the GitHub Flow Guide you can see that the experience is optimized for a conversation including micro adjustments of code along the way. At the end of the conversation the branch where the conversation took place is merged into the main branch. This leaves a commit record which represents the development of the conversation. For all kinds of reasons, I think the history of the development of an idea is fascinating. I support this work flow.

Contrast this conversation-driven work flow to the patching work flow I described initially. In the patching work flow, only whole ideas (or the final patch) are committed to the master branch. The thinking process is not included in the commit history (although it might be included in the commit message).

Why does this matter? It matters because Git is optimized to work with individual commits.

Working With Whole Ideas

Compare the graphical log for Git and another distributed version control project, Bazaar. By default Bazaar shows only the final commit from a given branch. You need to click the branch to see the commits (or the conversation) which happened along the way.

bazaar explorer shows branches collapsed by

In other words: the way Bazaar displays its commit history is optimized for the forking (and branching) work flows that GitHub and other social coding platforms promote. (Don't worry, this isn't going to turn into a Bazaar is Better Than Git article. I just wanted you to see an alternative way of perceiving the information.)

To contrast this, take a look at how Git displays commit history. Instead of showing groups of ideas, Git shows the individual commits along the entire length of a branch.

gitk and git log --graph output
focus on individual commits

Not only that, but look at those useless default commit messages in Git where a merged happened ("Merge pull request issue_number from username/branchname"). It doesn't tell me anything about what the branch was about! For that I have to go back and read the commit messages in every commit up to that point. Beginning from the first commit for the branch where the idea was first introduced. If you're the person trying to understand why a branch was merged in by using the information stored in Git, good luck to you! Typically the best way to obtain that information is now the external ticketing system, not the code repository itself.

Once you understand how Git was used by its original creators, and for what experience it has optimized itself, you are left with two options: adopt a ticketing system and use its built-in features to track your changes to the code base over time; or insist your team adopt a rebasing work flow and attempt to re-create the patching work flow which presents whole ideas, not conversations, in the commit history. If you choose the first option, you are essentially throwing out any future hope of using Git's built-in tools to do anything useful with your commit history. You are also, essentially, committing yourself to vendor lock-in for your ticketing system because the legible version of history will be captured in the ticket system, not in the commit messages themselves.

Over time this will allow (or force?) Git to stagnate as a product, increasing our dependence on individual (often proprietary) systems to hold our institutional / code history. This is scary because these systems tend to be built around companies, and sometimes companies don't last forever.

Woah. That was kind of alarmist, wasn't it.

Making Git Social

So where do we go from here? We begin, ironically, with a conversation about what's important to us. I don't have any perfect answers for you, but hopefully I have some thought-provoking ideas.

  1. Determine, with your team, if it is more important to have an easy-to-use set of Git commands to allow the widest possible group of people to participate; or if it is more important to be able to move your project to a new code hosting platform if your current one disappears tomorrow. If ease-of-use trumps portability, by all means continue to work in whatever commit granularity is easiest for the contributor. But if it's more important to be able to migrate the history of your code to a new platform, consider adopting (and enforcing) a rebasing culture where each commit is as close as possible to a whole idea.
  2. If it's important to you that we continue to make coding social, and if ease-of-participation is important, consider the limitations of the defacto source control system we use. Git is open source. There are loads of fantastic graphical interfaces which have been built on top of Git, but they're all still limited to the same original concept that a commit is a whole idea. I've spent time looking, but I haven't found tools which do a great job of visualizing the whole ideas of a branch. We can, and should, build better tools which allow us to track changes to our code in a "modern" fashion and without having to rely on third-party systems.
  3. Currently, the default commit message for a merge sucks if you're using a social coding work flow (branching or forking as opposed to patching). There's no reason why we can't make Git better at dealing with how we want to work with it. I don't know what the default message for a merge should be. A duplicate of the last commit message in the branch that's being merged (potential problem: what about a three-way merge)? Maybe you like the final commit message as is and it works for your team...or maybe you've never even really thought about it because you have a GUI which sits on top of Git.

I'm sure there are other questions as well. But this is where I'm at right now. I don't really mean to seem like an alarmist. I just think it's an interesting pickle that we've gotten ourselves into. The way we code has become very social. It's centred around conversations and in many cases it's based around a Web interface. I don't necessarily think that's a bad thing, but I'm old enough to remember when SourceForge was still cool (sorry SourceForge!). I remember seeing the wave of projects move towards self-hosting and now the wave back to a centralized system. I think we're losing the benefits of a distributed version control system by becoming so attached to our centralized systems; but I think we're there because Git is so damned opaque. ("Git is easy to learn! It's just a directed acyclic graph of your commits!" Seriously? Fuck that.)

I want us to do better. We need to do better. And if we can't do better, I at least want it to be a conscious choice to have limited ourselves to the participation models that were created in 2005 by a kernel developer whose relationship with a proprietary software product went sour and whose community license was revoked. When you think about it, you know that we absolutely must do better and that we cannot allow ourselves to be complacent when it comes to our livelihoods and participation in the Web development, and open source software communities. It's really just a question of where we need to start so that we can fix the pickle we seem to have gotten ourselves into.