Converting from Subversion to Mercurial

As I said in my last entry, I've been evaluating the various modern DVCSes to try and figure out which of them would give me the most benefit, while at the same time irritate me the least.

I've been using Subversion (SVN) for a few years now on my dev servers (formerly, svn.samhart.net and friends) and have mostly been pleased with it. In fact, the only reason I even considered replacing SVN was because there were certain aspects of DVCS that I felt could make my life easier, namely the ability to have a repo's entire history available locally and the fact that offline work can be done so much easier with them.

Additionally, I've been working with a lot of modern DVCSes lately (namely bzr, git and svk) and I've been very displeased by each of them. They all had at least one critical problem that, for me, made them impractical to even consider for use in my own repos. The end result is that I've spent a lot of time frustratingly researching and testing as many DVCSes as I could to try and figure out if I should switch or just stick with SVN.

But, after the smoke cleared and the fires died down, I discovered that one DVCS, Mercurial (Hg) was left standing on equal ground with SVN in the "has to not irritate me" department.

The problem? Conversion from SVN to Hg isn't as straightforward as one would like. Thus, I'm documenting the steps I had to do to try and help out anyone else who's attempting to go down this path.

For what it's worth, I don't plan on discussing what pro's and con's are involved with each of the DVCSes here. At the end of the day they each have comparable feature-sets and functionalities, and any choice as to which DVCS a person will use will likely be a very personal one (or at least one dictated by someone charge :-) Thus, I am not going to argue the benefits of Hg over any of the others, or even over SVN. I'm merely going to show how you can convert your existing SVN repos into Hg repos, as well as set up Hg to be allow for easy SVN-like pushes/pulls on your server.

System Information

I should mention a what I have been running, as well as what I will be running. I do this only because I know there's a myriad of ways to set up SVN and Hg, and unless you're doing what I'm doing, my notes wont help you much.

Traditionally, I ran SVN using WebDAV in Apache2.x. I wanted to continue to run Hg using Apache2.x (as this server has other needs for Apache2.x), but I no longer needed WebDAV for Hg. I'm also running Debian (with a mix of packages from stable, testing, and unstable).

Every tool that I mention in this guide can currently be found in Debain, their package names are:

Naturally, you can get these things up and running in other *nixes, but I'll leave that up to you to figure out if you decide to follow my guide.

Converting The Repos

Converting single/multiple SVN repos to Hg repos

This was perhaps the trickiest part of the process for me. This was because there's a plethora of tools for doing SVN to Hg conversions, but most of them don't seem to work well. I first tried yasvn2hg, but I couldn't get the damned script to even run. Next I tried Tailor which promised to be the Swiss Army Knife of repo conversion utilities. However, I had hours of headache and no progress using it. Finally, I tried hgsvn, and it worked like a charm.

hgsvn is apt-gettable in Debian. However hgsvn needs the functionality from python-setuptools, but its package does not require it. This means that, unless you already have python-setuptools installed for something else, chances are you will see this error when you install hgsvn and try to run it:

$ hgimportsvn http://url.to.repo/repo
Traceback (most recent call last):
  File "/usr/bin/hgimportsvn", line 5, in 
    from pkg_resources import load_entry_point
ImportError: No module named pkg_resources

If you get this error, simply install the python-setuptools package (or equivalent) and try again.

Once hgsvn and its needed libraries are installed on your system, the basic method to convert a repository is as follows:

$ hgimportsvn http://url.to.repo/repo
...^^^Sets up the import
$ cd repo
...^^^Changes to the freshly created subdir
$ hgpullsvn
...^^^Pulls down all the changes from svn and creates an hg history
$ hg update (optional)

Once you've done these steps, your repo will have been converted to Hg. This works well for single repositories, but what if you have something more complicated?

Splitting a single repo into multiple repos

If you're like me, when you originally set up SVN you did so in the laziest way possible.

Setting up SVN repos is more work than it should be. It involves using commands that you normally never have to touch (svnadmin), setting up new entries for those repos in your http server's configuration files (if you're using Apache and WebDAV), and setting up user permissions to those repos. Thus, the lazy way to set them up is to make one central SVN repo under which you have multiple sub-repos. This has the advantage of making your repository very easy to maintain. However has a big disadvantage in that a user with write access to any sub-repo will have write access to the entire repo.

In Hg, on the other hand, setting up a new repository is much easier, and maintaining multiple repositories more manageable. So, if you're like me, you may be tempted to remedy past sins by splitting your single gargantuan SVN repo into smaller Hg repos. Thankfully, hgsvn makes this very easy.

Let's say that you have one core SVN repo, called "main" which has the following sub-directories which you are treating as sub-repos:

main/
  projecta/
  projectb/
  projectc/

hgsvn can actually handle sub-directories of SVN repos and generate histories of just those sub-directories, effectively splitting the directories into repos of their own. It will even keep track of changes that only affect the individual sub-repo (meaning parent or neighbor changes don't get entered, unless they were otherwise combined in the original SVN).

A method for splitting the above could be:

$ hgimportsvn http://url.to.repo/main/projecta/
...^^^Start with "projecta/"
$ cd projecta
$ hgpullsvn
...^^^Pull the history for "projecta/"
$ hg update
...
$ cd ..
$ hgimportsvn http://url.to.repo/main/projectb/
...^^^Move on to "projectb/"
$ cd projectb
$ hgpullsvn
...^^^Pull the history for "projectb/"
$ hg update
...
etc.

Cleaning up the SVN cruft

When you're done using the hgimportsvn and hgpullsvn tools, you will have repos in a strange half-SVN/half-Hg form. They will be legitimate Hg repos, but they will still have the .svn directories strewn throughout them, and have some .hgignore files telling Hg to ignore said .svn directories. So, if we're going to go 100% Hg, we may as well get rid of this stuff.

$ cd repo/  (whatever the path is to your hgsvn made repo)
$ find . -name .svn | xargs rm -fr
...^^^Get rid of the .svn/ directories
$ find . -name .hgignore | xargs rm -fr
...^^^Get rid of the .hgignore entries

Configuring HTTP

Setting up the Hg http interface

As I said before, I used SVN via WebDAV and it was pretty painless once I got it going. Thus, I want Hg to behave exactly the same way when I do my pushes and pulls. Hg doesn't use WebDAV (at least, if it does, I didn't look deep enough into the documentation to figure out how to set it up), but it does come with a handy CGI script for giving you the same basic functionality.

Configuring hgwebdir.cgi

If you're only running one repo, there is a script called hgweb.cgi which is easy to configure and will handle your needs. However, since I run multiple repos, I decided to use another script called hgwebdir.cgi that serves up multiple Hg repos in one web-interface.

hgwebdir.cgi takes an external configuration file that defines the repos it will monitor. There are two ways you can configure this file for the repos.

The first is to use the [collections] directive which auto-magickally determines all of your Hg repos based upon some common root directory. For example, let's say that all your repos are under /var/repos:

/var/repos
      projecta/
      projectb/
      projectc/foo

You would then place the following in your hgwebdir.cgi configuration file if you wanted to use the [collections] directive:

[collections]
/var/repos = /var/repos

This configuration file would make your repos available online as "projecta", "projectb" and "projectc/foo".

However, if your repos are not under some common directory, or if maybe there's other items that aren't repos alongside your repos, then you can use the [paths] directive to itemize each one:

[paths]
projecta = /home/fred/hg/projecta/
projectb = /var/repo/

Whichever you do, save the file (the name doesn't matter, I just used hgweb.config) and edit the line in hgwebdir.cgi to point to this newly created configuration file. For example:

def make_web_app():
    return hgwebdir("/etc/hgweb/hgweb.config")

Configuring your webserver

Technically speaking, you're already set. Just stick the hgwebdir.cgi file someplace where CGI scripts can be executed and point your browser at it. However, at this point you can't push repository changes via this web interface. Additionally, I kind of wanted the URLs to look cleaner.

Clean URLs

You may be fine handing out repository URLs like http://someurl.com/cgi-bin/hgwebdir.cgi?mf=b22511d1eb56;path=/, but I'm not. I want my repository to have URLs that are clean as possible. So, I make sure mod_rewrite is enabled in my server (a2enmod rewrite, if you're running Apache2) and add the following to my Apache2 configuration entry on my hgwebdir.cgi:

        <IfModule mod_rewrite.c>
                RewriteEngine on
                RewriteRule ^/(.*) /hgwebdir.cgi/$1
        </IfModule>

User Authentication

Next up, I want users to have the ability to view and clone the repository anonymously, but need to be authenticated in order to push back to the server. Additionally, I want a central place for the htpasswd file (you could do this on a per-project basis, but I'll explain why you don't want to in a bit). So, I add the following to my Apache2 configuration entry on my hgwebdir.cgi:

        AuthUserFile /etc/hg/htpasswd
        AuthName "Dev Repo"
        AuthType Basic
        <Limit POST PUT>
                 Require valid-user
        </Limit>

The <Limit> segment is the magic that allows us to have anonymous access to the repository but in order to push you must be authenticated.

Note that in my example here we're using "AuthType Basic", which is probably not the best way to do it. However, it is the most simple way to show for this example. I leave it to the reader to figure out how to use another AuthType (or, make pushes go across SSL).

Making the CGI the index

The final thing we need to do is make it so that the hgwebdir.cgi script is the index when the server attempts to serve up the page and to make sure the server can handle CGI.

         DirectoryIndex hgwebdir.cgi
         AddHandler cgi-script .cgi
         Options ExecCGI
         Order allow,deny
         Allow from all

Putting it all together

If we put it all together and assign it to a virtual host, we get an entry like the following (which, if you're using Apache2 can just be placed as a file in sites-available):

<VirtualHost XXX.XXX.XXX.XXX:80>
        ServerName hg.someplace.com
        DocumentRoot /var/hg/hgweb
        <IfModule mod_rewrite.c>
                RewriteEngine on
                RewriteRule ^/(.*) /hgwebdir.cgi/$1
        </IfModule>
        <Directory /var/hg/hgweb>
                DirectoryIndex hgwebdir.cgi
                AddHandler cgi-script .cgi
                Options ExecCGI
                Order allow,deny
                Allow from all
                AuthUserFile /etc/hg/htpasswd
                AuthName "Dev Repo"
                AuthType Basic
                <Limit POST PUT>
                        Require valid-user
                </Limit>
        </Directory>
</VirtualHost>

Configuring your hgrc file(s)

The hgrc file is the general configuration file for all things Mercurial. There are always at least two possible hgrc files for every repository:

Inside of these hgrcs, you can define a directive called [web] which controls the behavior of the web-interface used in hgweb.cgi and hgwebdir.cgi.

System-wide web hgrc settings

Hg's web-interface defaults using a style that I personally find to be ugly and confusing to use. I much prefer the "gitweb" style over the default Mercurial style. So I set the "style" parameter in the [web] section of the system-wide hgrc to "gitweb" to make it the default style.

Additionally, I want compressed archives to be made available, and I want to set a system-wide contact. Finally, if you're using the same setup I've detailed above, you aren't using SSL for your pushes, which means that the push over SSL requirement should be disabled.

[web]
style = gitweb
allow_archive = bz2 gz zip
contact = Myself, me@somewhere.com
push_ssl = false

Per repository hgrc settings and user authentication

For each repo, you can define a specific hgrc file that will override the system-wide settings from /etc/mercurial/hgrc.

Generally speaking, you want to at least define a description for the repository as well as who is allowed to push. Additionally, you can define new contact information if it differs from the system-wide setting.

[web]
description = An addressbook for keeping track of your "friends"
contact = Ted Haggard, tedh@ilikethemens.com
allow_push = tedh

Now, you can easily define per repository htpasswd files, however, this can get unwieldy and is completely unnecessary. Instead, it makes more sense to define a global htpasswd file, but then define push rights per repository in the hgrc.

So I could have a global htpasswd file that defines all of my users like this

tedh:HGand8176
fred:87JIkn7j1*9
joe:87/joiqKl91
jake:jasmn1%1tba

But then define the following project push rights via their hgrc's:

Project A

[web]
descrtiption = Project A
allow_push = tedh, joe

Project B

[web]
descrtiption = Project B
allow_push = jake, fred

Project C

[web]
descrtiption = Project C
allow_push = tedh, jake, joe

Conclusion and Issues

When it's all said and done, you should have a working Hg repository server and should be able to pull/push from/to it.

However, there were some small issues that I ran into that I should note simply because they seemed to be a bit tricky.

abort: consistency error adding group!

The first problem I ran into was when I tried to get a friend of mine online to try out the new repository for our IRC bots code. When he tried to clone the repo, he got the following error:

[17:02] < schultmc> | $ hg clone http://dev.samhart.net/bots/
[17:02] < schultmc> | destination directory: bots
[17:02] < schultmc> | requesting all changes
[17:02] < schultmc> | adding changesets
[17:02] < schultmc> | adding manifests
[17:02] < schultmc> | adding file changes
[17:02] < schultmc> | abort: consistency error adding group!
[17:02] < schultmc> | transaction abort!
[17:02] < schultmc> | rollback completed

Additionally, when I tried to clone it, I would either get the same error he did, or get the following:

$ hg -v clone http://dev.samhart.net/bots
destination directory: bots
requesting all changes
adding changesets
adding manifests
adding file changes
abort: premature EOF reading chunk (got 6822 bytes, expected 34384)
transaction abort!
rollback completed

The strange thing was, other repos worked fine, and a "hg verify" on the server revealed them all to be in working order.

As it turns out, both of these problems have the same root cause: errors on the server. In my case the system was running out of resources, so adding a bit more swap solved the problem. However, it could also be things like permission problems or other misc. Apache errors.

At any rate, if you get errors that look like the above, chances are they are server errors and you should look very closely at what's going on during each attempted transaction.

Stacktrace on push with wrong username/password

This is just an ugly stacktrace, but doesn't seem to cause any problems. If you try to push and use the wrong username/password, you will get a stacktrace that looks a lot like the following:

** unknown exception encountered, details follow
** report bug details to http://www.selenic.com/mercurial/bts
** or mercurial@selenic.com
** Mercurial Distributed SCM (version 0.9.3)
Traceback (most recent call last):
  File "/usr/bin/hg", line 12, in ?
    commands.run()
  File "/var/lib/python-support/python2.4/mercurial/commands.py", line 3000, in run
    sys.exit(dispatch(sys.argv[1:]))
  File "/var/lib/python-support/python2.4/mercurial/commands.py", line 3223, in dispatch
    return d()
  File "/var/lib/python-support/python2.4/mercurial/commands.py", line 3182, in 
    d = lambda: func(u, repo, *args, **cmdoptions)
  File "/var/lib/python-support/python2.4/mercurial/commands.py", line 1971, in push
    r = repo.push(other, opts['force'], revs=revs)
  File "/var/lib/python-support/python2.4/hgext/mq.py", line 2025, in push
    return super(mqrepo, self).push(remote, force, revs)
  File "/var/lib/python-support/python2.4/mercurial/localrepo.py", line 1360, in push
    return self.push_unbundle(remote, force, revs)
  File "/var/lib/python-support/python2.4/mercurial/localrepo.py", line 1438, in push_unbundle
    return remote.unbundle(cg, remote_heads, 'push')
  File "/var/lib/python-support/python2.4/mercurial/httprepo.py", line 352, in unbundle
    heads=' '.join(map(hex, heads)))
  File "/var/lib/python-support/python2.4/mercurial/httprepo.py", line 235, in do_cmd
    resp = urllib2.urlopen(urllib2.Request(cu, data, headers))
  File "/usr/lib/python2.4/urllib2.py", line 130, in urlopen
    return _opener.open(url, data)
  File "/usr/lib/python2.4/urllib2.py", line 364, in open
    response = meth(req, response)
  File "/usr/lib/python2.4/urllib2.py", line 471, in http_response
    response = self.parent.error(
  File "/usr/lib/python2.4/urllib2.py", line 396, in error
    result = self._call_chain(*args)
  File "/usr/lib/python2.4/urllib2.py", line 337, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.4/urllib2.py", line 741, in http_error_401
    host, req, headers)
  File "/usr/lib/python2.4/urllib2.py", line 720, in http_error_auth_reqed
    return self.retry_http_basic_auth(host, req, realm)
  File "/usr/lib/python2.4/urllib2.py", line 730, in retry_http_basic_auth
    return self.parent.open(req)
  File "/usr/lib/python2.4/urllib2.py", line 364, in open
    response = meth(req, response)
  File "/usr/lib/python2.4/urllib2.py", line 471, in http_response
    response = self.parent.error(
  File "/usr/lib/python2.4/urllib2.py", line 396, in error
    result = self._call_chain(*args)
  File "/usr/lib/python2.4/urllib2.py", line 337, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.4/urllib2.py", line 916, in http_error_401
    host, req, headers)
  File "/usr/lib/python2.4/urllib2.py", line 807, in http_error_auth_reqed
    raise ValueError("AbstractDigestAuthHandler doesn't know "
ValueError: AbstractDigestAuthHandler doesn't know about Basic

I've been searching the various bug databases involved (Debian's and Mercurial's) but haven't found this particular problem yet. Will likely be filing a bug in a bit.

See Also

Smattering of links that I used to figure all this out. In no particular order.