Version Control

A version control system (or revision control system) is a combination of technologies and practices for tracking and controlling changes to a project's files, in particular to source code, documentation, and web pages. If you have never used version control before, the first thing you should do is go find someone who has, and get them to join your project. These days, everyone will expect at least your project's source code to be under version control, and probably will not take the project seriously if it doesn't use version control with at least minimal competence.

The reason version control is so universal is that it helps with virtually every aspect of running a project: inter-developer communications, release management, bug management, code stability and experimental development efforts, and attribution and authorization of changes by particular developers. The version control system provides a central coordinating force among all of these areas. The core of version control is change management: identifying each discrete change made to the project's files, annotating each change with metadata like the change's date and author, and then replaying these facts to whoever asks, in whatever way they ask. It is a communications mechanism where a change is the basic unit of information.

This section does not discuss all aspects of using a version control system. It's so all-encompassing that it must be addressed topically throughout the book. Here, we will concentrate on choosing and setting up a version control system in a way that will foster cooperative development down the road.

Version Control Vocabulary

This book cannot teach you how to use version control if you've never used it before, but it would be impossible to discuss the subject without a few key terms. These terms are useful independently of any particular version control system: they are the basic nouns and verbs of networked collaboration, and will be used generically throughout the rest of this book. Even if there were no version control systems in the world, the problem of change management would remain, and these words give us a language for talking about that problem concisely.

If you're comfortably experienced with version control already, you can probably skip this section. If you're not sure, then read through this section at least once. Certain version control terms have gradually changed in meaning since the early 2000s, and you may occasionally find people using them in incompatible ways in the same conversation. Being able to detect that phenomenon early in a discussion can often be helpful.

commit

To make a change to the project; more formally, to store a change in the version control database in such a way that it can be incorporated into future releases of the project. "Commit" can be used as a verb or a noun. For example: "I just committed a fix for the server crash bug people have been reporting on Mac OS X. Jay, could you please review the commit and check that I'm not misusing the allocator there?"

push

To publish a commit to a publicly online repository, from which others can incorporate it into their copy of the project's code. When one says one has pushed a commit, the destination repository is usually implied. Often it is the project's master repository, the one from which public releases are made, but not always.

Note that in some version control systems (e.g., Subversion), commits are automatically and unavoidably pushed up to a predetermined central repository, while in others (e.g., Git, Mercurial) the developer chooses when and where to push commits. Because the former types privilege a particular central repository, they are known as "centralized" version control systems, while the latter are known as "decentralized". In general, decentralized systems are the modern trend, especially for open source projects, which benefit from the peer-to-peer relationship between developers' repositories.

pull or update

To pull others' changes (commits) into your local copy of the project. When pulling changes from a project's mainline development branch (see branch), people often say "update" instead of "pull", for example: "Hey, I noticed the indexing code is always dropping the last byte. Is this a new bug?" "Yes, but it was fixed last week—try updating and it should go away."

commit message or log message

A bit of commentary attached to each commit, describing the nature and purpose of the commit (both terms are used about equally often; I'll use them interchangeably in this book). Log messages are among the most important documents in any project: they are the bridge between the detailed, highly technical meaning of each individual code changes and the more user-visible world of bugfixes, features and project progress. Later in this section, we'll look at ways to distribute them to the appropriate audiences; also, the section called “Codifying Tradition” in Chapter 6, Communications discusses ways to encourage contributors to write concise and useful commit messages.

repository

A database in which changes are stored and from which they are published. In centralized version control systems, there is a single, master repository, which stores all changes to the project, and each developer works with a kind of latest summary on her own machine. In decentralized systems, each developer has her own repository, changes can be swapped back and forth between repositories arbitrarily, and the question of which repository is the "master" (that is, the one from which public releases are rolled) is defined purely by social convention, instead of by a combination of social convention and technical enforcement.

clone (see also checkout)

To obtain one's own development repository by making a copy of the project's central repository.

checkout

When used in discussion, "checkout" usually means something like "clone", except that centralized systems don't really clone the full repository, they just obtain a working copy or working files. When decentralized systems use the word "checkout", they also mean the process of obtaining working files from a repository, but since the repository is local in that case, the user experience is quite different because the network is not involved.

In the centralized sense, a checkout produces a directory tree called a "working copy" (see below), from which changes may be sent back to the original repository.

working copy or working files

A developer's private directory tree containing the project's source code files, and possibly its web pages or other documents, in a form that allows the developer to edit them. A working copy also contains some version control metadata saying what repository it comes from, what branch it represents, and a few other things. Typically, each developer has her own working copy, from which she edits, tests, commits, pulls, pushes, etc.

In decentralized systems, working copies and repositories are usually colocated anyway, so the term "working copy" is less often used. Developers instead tend to say "my clone" or "my copy" or sometimes "my fork".

revision, change, changeset, or (again) commit

A "revision" is a precisely specified incarnation of the project at a point in time, or of a particular file or directory in the project. These days, most systems also use "revision", "change", "changeset", or "commit" to refer to a set of changes committed together as one conceptual unit, if multiple files were involved, though colloquially most people would refer to changeset 12's effect on file F as "revision 12 of F".

These terms occasionally have distinct technical meanings in different version control systems, but the general idea is always the same: they give a way to speak precisely about exact points in time in the history of a file or a set of files (say, immediately before and after a bug is fixed). For example: "Oh yes, she fixed that in revision 10" or "She fixed that in commit fa458b1fac".

When one talks about a file or collection of files without specifying a particular revision, it is generally assumed that one means the most recent revision(s) available.

diff

A textual representation of a change. A diff shows which lines were changed and how, plus a few lines of surrounding context on either side. A developer who is already familiar with some code can usually read a diff against that code and understand what the change did, and often even spot bugs.

tag or snapshot

A label for a particular state of the project at a point in time. Tags are generally used to mark interesting snapshots of the project. For example, a tag is usually made for each public release, so that one can obtain, directly from the version control system, the exact set of files/revisions comprising that release. Tag names are often things like Release_1_0, Delivery_20130630, etc.

branch

A copy of the project, under version control but isolated so that changes made to the branch don't affect other branches of the project, and vice versa, except when changes are deliberately "merged" from one branch to another (see below). Branches are also known as "lines of development". Even when a project has no explicit branches, development is still considered to be happening on the "main branch", also known as the "main line" or "trunk" or "master".

Branches offer a way to keep different lines of development from interfering with each other. For example, a branch can be used for experimental development that would be too destabilizing for the main trunk. Or conversely, a branch can be used as a place to stabilize a new release. During the release process, regular development would continue uninterrupted in the main branch of the repository; meanwhile, on the release branch, no changes are allowed except those approved by the release managers. This way, making a release needn't interfere with ongoing development work. See the section called “Use branches to avoid bottlenecks” later in this chapter for a more detailed discussion of branching.

merge or port

To move a change from one branch to another. This includes merging from the main trunk to some other branch, or vice versa. In fact, those are the most common kinds of merges; it is less common to port a change between two non-trunk branches. See the section called “Singularity of information” for more on change porting.

"Merge" has a second, related meaning: it is what some version control systems do when they see that two people have changed the same file but in non-overlapping ways. Since the two changes do not interfere with each other, when one of the people updates their copy of the file (already containing their own changes), the other person's changes will be automatically merged in. This is very common, especially on projects where multiple people are hacking on the same code. When two different changes do overlap, the result is a "conflict"; see below.

conflict

What happens when two people try to make different changes to the same place in the code. All version control systems automatically detect conflicts, and notify at least one of the humans involved that their changes conflict with someone else's. It is then up to that human to resolve the conflict, and to communicate that resolution to the version control system.

lock

A way to declare an exclusive intent to change a particular file or directory. For example, "I can't commit any changes to the web pages right now. It seems Alfred has them all locked while he fixes their background images." Not all version control systems even offer the ability to lock, and of those that do, not all require the locking feature to be used. This is because parallel, simultaneous development is the norm, and locking people out of files is (usually) contrary to this ideal.

Version control systems that require locking to make commits are said to use the lock-modify-unlock model. Those that do not are said to use the copy-modify-merge model. An excellent in-depth explanation and comparison of the two models may be found at svnbook.red-bean.com/nightly/en/svn.basic.version-control-basics.html#svn.basic.vsn-models. In general, the copy-modify-merge model is better for open source development, and all the version control systems discussed in this book support that model.

Choosing a Version Control System

If you don't already have a strong opinion about which version control system your project should use, then choose Git (git-scm.com), and host your project's repositories at GitHub.com, which offers unlimited free hosting for open source projects.

Git is by now the de facto standard in the open source world, as is hosting one's repositories at GitHub. Because so many developers are already comfortable with that combination, choosing it sends the signal that your project is ready for participants. But Git-at-GitHub is not the only viable combination. Two other reasonable choices of version control system are Mercurial and Subversion. Mercurial and Git are both decentralized systems, whereas Subversion is centralized. All three are offered at many different free hosting services; some services even support more than one of them (though GitHub only supports Git, as its name suggests). While some projects host their repositories on their own servers, most just put their repositories on one of the free hosting services, as described in Appendix A, Canned Hosting Sites.

There isn't space here for an in-depth exploration of why you might choose something other than Git. If you have a reason to do so, then you already know what that reason is. If you don't, then just use Git (and probably on GitHub). If you find yourself using something other than Git, Mercurial, or Subversion, ask yourself why — because whatever that other version control system is, most other developers won't be familiar with it, and it likely has a smaller and less stable community of support around it than the big three do.

Using the Version Control System

The recommendations in this section are not targeted toward a particular version control system, and should be implementable in any of them. Consult your specific system's documentation for details.

Version everything

Keep not only your project's source code under version control, but also its web pages, documentation, FAQ, design notes, and anything else that people might want to edit. Keep them right with the source code, in the same repository tree. Any piece of information worth writing down is worth versioning—that is, any piece of information that could change. Things that don't change should be archived, not versioned. For example, an email, once posted, does not change; therefore, versioning it wouldn't make sense (unless it becomes part of some larger, evolving document).

The reason to version everything together in one place is so that people only have to learn one mechanism for submitting changes. Often a contributor will start out making edits to the web pages or documentation, and move to small code contributions later, for example. When the project uses the same system for all kinds of submissions, people only have to learn the ropes once. Versioning everything together also means that new features can be committed together with their documentation updates, that branching the code will branch the documentation too, etc.

Don't keep generated files under version control. They are not truly editable data, since they are produced programmatically from other files. For example, some build systems create a file named configure based on a template in configure.in. To make a change to the configure, one would edit configure.in and then regenerate; thus, only the template configure.in is an "editable file." Just version the templates—if you version the generated files as well, people will inevitably forget to regenerate them when they commit a change to a template, and the resulting inconsistencies will cause no end of confusion.

There are technical exceptions to the rule that all editable data should be kept in the same version control system as the code. For example, a project's bug tracker and its wiki hold plenty of editable data, but usually do not store that data in the main version control system[30]. However, they should still have versioning systems of their own, e.g., the comment history in a bug ticket, and the ability to browse past revisions and view differences between them in a wiki.

Browsability

The project's repository should be browsable on the Web. This means not only the ability to see the latest revisions of the project's files, but to go back in time and look at earlier revisions, view the differences between revisions, read log messages for selected changes, etc.

Browsability is important because it is a lightweight portal to project data. If the repository cannot be viewed through a web browser, then someone wanting to inspect a particular file (say, to see if a certain bugfix had made it into the code) would first have to install version control client software locally, which could turn their simple query from a two-minute task into a half-hour or longer task.

Browsability also implies canonical URLs for viewing a particular change (i.e., a commit), and for viewing the latest revision at any given time without specifying its commit identifier. This can be very useful in technical discussions or when pointing people to documentation. For example, instead of saying "For bug management guidelines, see the community-guide/index.html file in your working copy," one can say "For bug management guidelines, see http://subversion.apache.org/docs/community-guide/," giving a URL that always points to the latest revision of the community-guides/index.html file. The URL is better because it is completely unambiguous, and avoids the question of whether the addressee has an up-to-date working copy.

Some version control systems come with built-in repository-browsing mechanisms, and in any case most hosting sites offer a good web interface. But if you need to install a third-party tool for repository browsing, there are many out there. Three that support Git are GitLab (gitlab.org), GitWeb (git.wiki.kernel.org/index.php/Gitweb), and GitList (gitlist.org). For Subversion, there is ViewVC (viewvc.org). A web search will turn up plenty of others besides these.

Use branches to avoid bottlenecks

Non-expert version control users are sometimes a bit afraid of branching and merging. If you are among those people, resolve right now to conquer any fears you may have and take the time to learn how to do branching and merging. They are not difficult operations, once you get used to them, and they become increasingly important as a project acquires more developers.

Branches are valuable because they turn a scarce resource—working room in the project's code—into an abundant one. Normally, all developers work together in the same sandbox, constructing the same castle. When someone wants to add a new drawbridge, but can't convince everyone else that it would be an improvement, branching makes it possible for her to make a copy of the castle, take it off to an isolated corner, and try out the new drawbridge design. If the effort succeeds, she can invite the other developers to examine the result (in GitHub-speak, this invitation is known as a "pull request" — see the section called “Pull requests”). If everyone agrees that the result is good, she or someone else can tell the version control system to move ("merge") the drawbridge from the branch version of the castle over to the main version, sometimes called the master branch.

It's easy to see how this ability helps collaborative development. People need the freedom to try new things without feeling like they're interfering with others' work. Equally importantly, there are times when code needs to be isolated from the usual development churn, in order to get a bug fixed or a release stabilized (see the section called “Stabilizing a Release” and the section called “Maintaining Multiple Release Lines” in Chapter 7, Packaging, Releasing, and Daily Development) without worrying about tracking a moving target. At the same time, people need to be able to review and comment on experimental work, whether it's happening in the master branch or somewhere else. Treating branches as first-class, publishable objects makes all this possible.

Use branches liberally, and encourage others to use them. But also make sure that a given branch is only active for as long as needed. Every active branch is a slight drain on the community's attention. Even those who are not working in a branch still maintain a peripheral awareness of what's going on in it. Such awareness is desirable, of course, and commit notices should be sent out for branch commits just as for any other commit. But branches should not become a mechanism for dividing the development community. With rare exceptions, the eventual goal of most branches should be to merge their changes back into the main line and disappear.

Singularity of information

Merging has an important corollary: never commit the same change twice. That is, a given change should enter the version control system exactly once. The revision (or set of revisions) in which the change entered is its unique identifier from then on. If it needs to be applied to branches other than the one on which it entered, then it should be merged from its original entry point to those other destinations—as opposed to committing a textually identical change, which would have the same effect in the code, but would make accurate bookkeeping and release management much harder.

The practical effects of this advice differ from one version control system to another. In some systems, merges are special events, fundamentally distinct from commits, and carry their own metadata with them. In others, the results of merges are committed the same way other changes are committed, so the primary means of distinguishing a "merge commit" from a "new change commit" is in the log message. In a merge's log message, don't repeat the log message of the original change. Instead, just indicate that this is a merge, and give the identifying revision of the original change, with at most a one-sentence summary of its effect. If someone wants to see the full log message, she should consult the original revision.

One reason it's important to avoid repeating the log message is that, in some systems, log messages are sometimes edited after they've been committed. If a change's log message were repeated at each merge destination, then even if someone edited the original message, she'd still leave all the repeats uncorrected—which would only cause confusion down the road. Another reason is that non-duplication makes it easier to be sure when one has tracked down the original source of a change. When you're looking at a complete log message that doesn't refer to a some other merge source, you can know that it must be the original change, and handle it accordingly.

The same principle applies to reverting a change. If a change is withdrawn from the code, then the log message for the reversion should merely state that some specific revision(s) is being reverted, not describe the actual code change that results from the reversion, since the semantics of the change can be derived by reading the original log message and change. Of course, the reversion's log message should also state the reason why the change is being reverted, but it should not duplicate anything from the original change's log message. If possible, go back and edit the original change's log message to point out that it was reverted.

All of the above implies that you should use a consistent syntax for referring to changes. This is helpful not only in log messages, but in emails, the bug tracker, and elsewhere. In Git and Mercurial, the syntax is usually "commit bb2377" (where the commit hash code on the right is long enough to be unique in the relevant context); in Subversion, revision numbers are linearly incremented integers and the standard syntax for, say, revision 1729 is "r1729". In other systems, there is usually a standard syntax for expressing the changeset name. Whatever the appropriate syntax is for your system, encourage people to use it when referring to changes. Consistent expression of change names makes project bookkeeping much easier (as we will see in Chapter 6, Communications and Chapter 7, Packaging, Releasing, and Daily Development), and since a lot of the bookkeeping may be done by volunteers, it needs to be as easy as possible.

See also the section called “Releases and Daily Development” in Chapter 7, Packaging, Releasing, and Daily Development.

Authorization

Many version control systems offer a feature whereby certain people can be allowed or disallowed from committing in specific sub-areas of the master repository. Following the principle that when handed a hammer, people start looking around for nails, many projects use this feature with abandon, carefully granting people access to just those areas where they have been approved to commit, and making sure they can't commit anywhere else. (See the section called “Committers” in Chapter 8, Managing Volunteers for how projects decide who can put changes where.)

Exercising such tight control is usually unnecessary, and may even be harmful. Some projects simply use an honor system: when a person is granted commit access, even for a sub-area of the project, what they actually receive is the ability to commit anywhere in the master repository. They're just asked to keep their commits in their area. Remember that there is little real risk here: the repository provides an audit trail, and in an active project, all commits are reviewed anyway. If someone commits where they're not supposed to, others will notice it and say something. If a change needs to be undone, that's simple enough—everything's under version control anyway, so just revert.

There are several advantages to this more relaxed approach. First, as developers expand into other areas (which they usually will if they stay with the project), there is no administrative overhead to granting them wider privileges. Once the decision is made, the person can just start committing in the new area right away.

Second, expansion can be done in a more fine-grained manner. Generally, a committer in area X who wants to expand to area Y will start posting patches against Y and asking for review. If someone who already has commit access to area Y sees such a patch and approves of it, she can just tell the submitter to commit the change directly (mentioning the reviewer/approver's name in the log message, of course). That way, the commit will come from the person who actually wrote the change, which is preferable from both an information management standpoint and from a crediting standpoint.

Last, and perhaps most important, using the honor system encourages an atmosphere of trust and mutual respect. Giving someone commit access to a subdomain is a statement about their technical preparedness—it says: "We see you have expertise to make commits in a certain domain, so go for it." But imposing strict authorization controls says: "Not only are we asserting a limit on your expertise, we're also a bit suspicious about your intentions." That's not the sort of statement you want to make if you can avoid it. Bringing someone into the project as a committer is an opportunity to initiate them into a circle of mutual trust. A good way to do that is to give them more power than they're supposed to use, then inform them that it's up to them to stay within the stated limits.

The Subversion project has operated on this honor system way or well over a decade, with more than 40 full committers and many more partial committers as of this writing. The only distinction the system actually enforces is between committers and non-committers; further subdivisions are maintained solely by human judgement. Yet the project never had a serious problem with someone deliberately committing outside their domain. Once or twice there's been an innocent misunderstanding about the extent of someone's commit privileges, but it's always been resolved quickly and amiably.

Obviously, in situations where self-policing is impractical, you must rely on hard authorization controls. But such situations are rare. Even when there are millions of lines of code and hundreds or thousands of developers, a commit to any given code module should still be reviewed by those who work on that module, and they can recognize if someone committed there who wasn't supposed to. If regular commit review isn't happening, then the project has bigger problems to deal with than the authorization system anyway.

In summary, don't spend too much time fiddling with the version control authorization system, unless you have a specific reason to. It usually won't bring much tangible benefit, and there are advantages to relying on human controls instead.

None of this should be taken to mean that the restrictions themselves are unimportant, of course. It would be bad for a project to encourage people to commit in areas where they're not qualified. Furthermore, in many projects, full (unrestricted) commit access has a special corollary status: it implies voting rights on project-wide questions. This political aspect of commit access is discussed more in the section called “Who Votes?” in Chapter 4, Social and Political Infrastructure.

Tools for commit review

9 September 2013: If you're reading this note, then you've encountered this section while it's in the process of being written as part of the overall update of this book (see producingoss.com/v2.html).

poss2 todo: there are three main things to cover here: Gerrit and similar tools, the GitHub PR model, and commit emails. Intro paragraph should give an overview and describe how they interact, then a short section on each. The section for commit emails is already done as it was just moved here from its old home as a subsection of the "Using the Version Control System" section. Discuss how human-centered commit review can be linked with automated buildbots that may or may not be a hard gateway to the central repository.

Pull requests

TBD (Gerrit et al)

Commit emails

Every commit to the repository should generate an email showing who made the change, when they made it, what files and directories changed, and how they changed. The email should go to a special mailing list devoted to commit emails, separate from the mailing lists to which humans post. Developers and other interested parties should be encouraged to subscribe to the commits list, as it is the most effective way to keep up with what's happening in the project at the code level. Aside from the obvious technical benefits of peer review (see the section called “Practice Conspicuous Code Review”), commit emails help create a sense of community, because they establish a shared environment in which people can react to events (commits) that they know are visible to others as well.

The specifics of setting up commit emails will vary depending on your version control system, but usually there's a script or other packaged facility for doing it. If you're having trouble finding it, try looking for documentation on hooks (or sometimes triggers) specifically a post-commit hook hook. Post-commit hooks are a general means of launching automated tasks in response to commits. The hook is triggered when a given commit finalizes, is fed all the information about that commit, and is then free to use that information to do anything—for example, to send out an email.

With pre-packaged commit email systems, you may want to modify some of the default behaviors:

  1. Some commit mailers don't include the actual diffs in the email, but instead provide a URL to view the change on the web using the repository browsing system. While it's good to provide the URL, so the change can be referred to later, it is also important that the commit email include the diffs themselves. Reading email is already part of people's routine, so if the content of the change is visible right there in the commit email, developers will review the commit on the spot, without leaving their mail reader. If they have to click on a URL to review the change, most won't do it, because that requires a new action instead of a continuation of what they were already doing. Furthermore, if the reviewer wants to ask something about the change, it's vastly easier to hit reply-with-text and simply annotate the quoted diff than it is to visit a web page and laboriously cut-and-paste parts of the diff from web browser to email client.

    (Of course, if the diff is huge, such as when a large body of new code has been added to the repository, then it makes sense to omit the diff and offer only the URL. Most commit mailers can do this kind of size-limiting automatically. If yours can't, then it's still better to include diffs, and live with the occasional huge email, than to leave the diffs off entirely. Convenient reviewing and commenting is a cornerstone of cooperative development, and much too important to do without.)

  2. The commit emails should set their Reply-to header to the regular development list, not the commit email list. That is, when someone reviews a commit and writes a response, their response should be automatically directed toward the human development list, where technical issues are normally discussed. There are a few reasons for this. First, you want to keep all technical discussion on one list, because that's where people expect it to happen, and because that way there's only one archive to search. Second, there might be interested parties not subscribed to the commit email list. Third, the commit email list advertises itself as a service for watching commits, not for watching commits and having occasional technical discussions. Those who subscribed to the commit email list did not sign up for anything but commit emails; sending them other material via that list would violate an implicit contract.

    Note that this advice to set Reply-to does not contradict the recommendations in the section called “The Great Reply-to Debate” earlier in this chapter. It's always okay for the sender of a message to set Reply-to. In this case, the sender is the version control system itself, and it sets Reply-to in order to indicate that the appropriate place for replies is the development mailing list, not the commit list.



[30] There are development environments that integrate everything into one unified version control world; see Veracity for an example.