For two of the last software releases I did, I was facing this question. Let me give a bit of background. Recently the documentation generation for Beast and Rapicorn fully switched over to Doxygen. This has brought a number of advantages, such as graphs for the C++ inheritance of classes, build tests of documentation example code and integration with the Python documentation. What was left for improvement was a simplification of the build process and logic involved however.
Maintaining the documentation builds has become increasingly complex. One of the things adding to the complexity are increased dependencies, external tools are required for a number of features: E.g. Doxygen (and thus Qt) is required to build the main docs, dot is required for all sorts of graph generation, python scripts are used to auto-extract some documentation bits, rsync is used for incremental updates of the documentation, Git is used to extract interesting meta information like the documentation build version number or the commit logs, etc.
For the next release, I faced the task of looking into making the documentation generation rules work for tarballs, outside of the Git repository. That means building in an environment significantly different from the usual development setup and toolchain (of which Git has become an important part). At the very least, this was required:
Creating autoconf rules to check for Doxygen and all its dependencies.
Require users to have a working Qt installation in order to build Rapicorn.
Deriving a documentation build version id without access to Git.
Getting the build dependencies right so we auto-build when Git is around but don’t break when Git’s not around.
All of this just for the little gain of enabling the normal documentation re-generation for someone wanting to start development off a tarball release. Development based on tarballs? Is this a modern use case?
During this year’s LinuxTag, I’ve taken the chance to enter discussions and get feedback on development habits in 2013. Development based on tarballs certainly was the norm when I started in Free Software & Open Source, that was 1996. It’s totally not the case these days. A large number of projects moved to Git or the likes. Rapicorn and Beast have been moved to Git several years ago, we adopted the commit style of the Linux kernel and a GNU-style ChangeLog plus commit hash ids is auto-generated from Git for the tarballs. Utilizing the meta information for a project living in Git comes naturally as time passes and projects get more familiar with Git. Examples are signed tags, scripts around branch-/merge-conventions, history greping or symbolic version id generation. Git also significantly improves spin-off developments which is why development of Git hosted projects generally happens in Git branches or Git clones these days. Sites like github encourage forking and pulling, going back to the inconveniences of tarball based development baring any history would be a giant leap backwards. In fact, these days tarballs serve as little more than a transport container for a specific snapshot of a Git repository.
Taking a step back, it’d seem easier to avoid the hassle of adapting all the documentation build logic to work both ways, with and without Git, by simply including a copy of the readily built result into the tarball. Like everything, there’s a downside here as well of course, tarball size will increase significantly. Just how significantly? The actual size can make or break the deal, e.g. if it changed by orders of magnitude. Let’s take a look:
6.0M - beast-0.8.0.tar.bz2
23.0M - beast-full-docs.tar-bz2
Uhhh, that’s a lot. All the documentation for the next release totals around almost four times that of the last tarball size. That’s a bit excessive, can we do better?
It turns out that a large portion of the space in a full Doxygen HTML build is actually used up by images. Not the biggest chunk but a large one nevertheless, for the above example, we’re looking at:
23M - du -hc full-docs/*png
73M - du -hc full-docs/
So, 23 out of 73 MB for the images, that’s 32%. Doxygen doesn’t make it too hard to build without images, it just needs two configuration settings
HAVE_DOT = NO and
CLASS_DIAGRAMS = NO. Rebuilding the docs without any images also removes a number of image references, so we end up with:
42M - slim-docs/
That’s a 42% reduction in documentation size. Actually that’s just plain text documentation now, without any pre-compressed PNG images. That means bzip2 could do a decent job at it, let’s give it a try:
2.4M - beast-slim-docs.tar-bz2
Wow, that went better than expected, we’re just talking about 40% of the source code tarball at this point. Definitely acceptable, here’re the numbers for the release candidates in direct comparison, with and without pre-built documentation:
6.1M - beast-no-docs-0.8.1-rc1.tar.bz2
8.6M - beast-full-docs-0.8.1-rc1.tar.bz2
Now that we’ve established that shipping documentation without graphs results in an acceptable tarball size increase, it’s easy to make the call to include full documentations with tarball releases.
As a nice side effect, auto-generation of the documentation in tarballs can be disabled (not having the Git tree and other tools available, it’d be prone to fail anway). The only thing to watch out for is a
srcdir != builddir case with automake, as in Git trees documentation is build inside builddir, while it’s shipped and available from within srcdir in tarballs.
Con: Tarball sizes increase, but the size difference seems accaptable, practical tests show less than 50% increase in tarball sizes for documentation excluding generated graphics.
Con: Tarball source changes cannot be reflected in docs. This mostly affects packagers, it’d be nice to receive substantial patches in upstream as a remedy.
Pro: The build logic is significantly simplified, allowing a hard dependency on Git and skipping complex conditionals for tool availability.
Pro: Build time and complexity from tarballs is reduced. A nice side effect, considering the variety of documentation tools out there, sgml-tools, doxygen, gtk-doc, etc.
For me the pros in this case clearly outweigh the cons. I’m happy to hear about pros and cons I might have missed.
Looking around the web for cases of other projects doing this didn’t turn up too many examples. There’s some probability that most projects don’t yet trade documentation generation rules for pre-generated documentation in tarballs.
I’m also very interested in end-user and packager’s opinions on this. Also, do people know about other materials that projects include pre-built in tarballs? And without the means to regenerate everything from just the tarballs?