<< January 2016 | Home | March 2016 >>

Selective repository import via "git filter-branch"

This article is concerned with selective importing of file repositories while keeping version history intact. It is not particularly about Java technology, but I'm tossing it in here anyway since I have been blogging here about my adventures with version control software before. What has changed for me since 2013 is that the project has now expanded and grown together with some other projects, which all use Git and GitHub. It's no longer a one-man project. The core parts of my Java EE code are still managed by Fossil, but other parts are being migrated to GitHub in order to make life a bit easier for the team as a whole.

The first part of the migration is to export the entire Fossil repository to a file in Git's "fast-import" format. Just cd to your Fossil working directory and do:

    fossil export --git > ~/repo.data

The second part of the migration concerns importing this file. Nothing is Fossil-specific about this, it works the same way if the original export was made from Subversion, Mercurial, CVS, or whatever. First cd to wherever you want to store your new Git repo, then do:

    git init new-repo
    cd new-repo
    git fast-import < ~/repo.data
    git checkout trunk

The last command checks out the branch you want filtered. It is named "trunk" in this example since in my case the original repository lived in Subversion before it became a Fossil repository.

The third part concerns filtering the checked-out branch. This is the selective part of the procedure. Create a script "filter.pl" that when given a pathname as input, prints it out if it should be removed from the branch (e.g. filtered out), and prints nothing otherwise:

    #!/usr/bin/perl
    if(m{/naughty-secrets.txt$}) { print; }
    elsif(!m{/(subproject1|subproject2|subproject3)/}) { print; }
    1;

The above example script removes any file named "naughty-secrets.txt" in any directory, and everything that is not stored under directories named "subproject1", "subproject2", or "subproject3."

Now run the git filter-branch command:

    git filter-branch --index-filter \
	   "git ls-files --cached | \
	       perl -n /home/user/filter.pl | \
	       xargs -r git rm --cached --ignore-unmatch -- >/dev/null" \
	   --prune-empty -- --all

This make take a while. The filtering script will be called once for each commit in the branch. To speed it up a bit you can mount a tmpfs file system (or some other ramdisk implementation), and move the stuff there. When the filtering is finished you should inspect your tree to check that it contains all the files that should be in the import, and no others. Check some sample file git-logs to verify that the history is visible. Not all history may remain however in cases where files have been moved around and renamed so that parts of their history are rejected by the filter.

When satisfied with the filtering part, you can now proceed with the optional step of setting the author and committer attributes:

    git filter-branch -f --env-filter "
         GIT_AUTHOR_NAME='Firstname Lastname'
         GIT_AUTHOR_EMAIL='github@contact.example.com'
         GIT_COMMITTER_NAME='Firstname Lastname'
         GIT_COMMITTER_EMAIL='github@contact.example.com'
      " -- --all

This is much faster than the file filtering part, and especially so if the filtered repo is a lot smaller than the original.

Finally, you can now merge the selective import into your target repo. This is also optional of course. Here is an example:

    cd ../target-repo
    git pull ../new-repo
    git mv trunk/somedir target-root
    git commit

Depending on your particular circumstances, there may be more moving and renaming needed to fit the imported stuff correctly into the target repository.