Aug 062013

Reblogged from the Lanedo GmbH blog:

Documentation Tools

Would you want to invest hours or days into automake logic without a use case?

For two of the last software releases I did, I was facing this question. Let me give a bit of background. Recently the documentation generation for Beast and Rapicorn fully switched over to Doxygen. This has brought a number of advantages, such as graphs for the C++ inheritance of classes, build tests of documentation example code and integration with the Python documentation. What was left for improvement was a simplification of the build process and logic involved however.

Generating polished documentation is time consuming

Maintaining the documentation builds has become increasingly complex. One of the things adding to the complexity are increased dependencies, external tools are required for a number of features: E.g. Doxygen (and thus Qt) is required to build the main docs, dot is required for all sorts of graph generation, python scripts are used to auto-extract some documentation bits, rsync is used for incremental updates of the documentation, Git is used to extract interesting meta information like the documentation build version number or the commit logs, etc.

More complexity for tarballs?

For the next release, I faced the task of looking into making the documentation generation rules work for tarballs, outside of the Git repository. That means building in an environment significantly different from the usual development setup and toolchain (of which Git has become an important part). At the very least, this was required:

  • Creating autoconf rules to check for Doxygen and all its dependencies.
  • Require users to have a working Qt installation in order to build Rapicorn.
  • Deriving a documentation build version id without access to Git.
  • Getting the build dependencies right so we auto-build when Git is around but don’t break when Git’s not around.

All of this just for the little gain of enabling the normal documentation re-generation for someone wanting to start development off a tarball release.
Development based on tarballs? Is this a modern use case?

Development happens in Git

During this year’s LinuxTag, I’ve taken the chance to enter discussions and get feedback on development habits in 2013. Development based on tarballs certainly was the norm when I started in Free Software & Open Source, that was 1996. It’s totally not the case these days. A large number of projects moved to Git or the likes. Rapicorn and Beast have been moved to Git several years ago, we adopted the commit style of the Linux kernel and a GNU-style ChangeLog plus commit hash ids is auto-generated from Git for the tarballs.
Utilizing the meta information for a project living in Git comes naturally as time passes and projects get more familiar with Git. Examples are signed tags, scripts around branch-/merge-conventions, history greping or symbolic version id generation. Git also significantly improves spin-off developments which is why development of Git hosted projects generally happens in Git branches or Git clones these days. Sites like github encourage forking and pulling, going back to the inconveniences of tarball based development baring any history would be a giant leap backwards. In fact, these days tarballs serve as little more than a transport container for a specific snapshot of a Git repository.

Shipping pre-built Documentation

Taking a step back, it’d seem easier to avoid the hassle of adapting all the documentation build logic to work both ways, with and without Git, by simply including a copy of the readily built result into the tarball. Like everything, there’s a downside here as well of course, tarball size will increase significantly. Just how significantly? The actual size can make or break the deal, e.g. if it changed by orders of magnitude. Let’s take a look:

  • 6.0M – beast-0.8.0.tar.bz2
  • 23.0M – beast-full-docs.tar-bz2

Uhhh, that’s a lot. All the documentation for the next release totals around almost four times that of the last tarball size. That’s a bit excessive, can we do better?

It turns out that a large portion of the space in a full Doxygen HTML build is actually used up by images. Not the biggest chunk but a large one nevertheless, for the above example, we’re looking at:

  • 23M – du -hc full-docs/*png
  • 73M – du -hc full-docs/

So, 23 out of 73 MB for the images, that’s 32%. Doxygen doesn’t make it too hard to build without images, it just needs two configuration settings HAVE_DOT = NO and CLASS_DIAGRAMS = NO. Rebuilding the docs without any images also removes a number of image references, so we end up with:

  • 42M – slim-docs/

That’s a 42% reduction in documentation size. Actually that’s just plain text documentation now, without any pre-compressed PNG images. That means bzip2 could do a decent job at it, let’s give it a try:

  • 2.4M – beast-slim-docs.tar-bz2

Wow, that went better than expected, we’re just talking about 40% of the source code tarball at this point. Definitely acceptable, here’re the numbers for the release candidates in direct comparison, with and without pre-built documentation:

  • 6.1M – beast-no-docs-0.8.1-rc1.tar.bz2
  • 8.6M – beast-full-docs-0.8.1-rc1.tar.bz2

Disable Documentation generation rules in tarballs

Now that we’ve established that shipping documentation without graphs results in an acceptable tarball size increase, it’s easy to make the call to include full documentations with tarball releases. As a nice side effect, auto-generation of the documentation in tarballs can be disabled (not having the Git tree and other tools available, it’d be prone to fail anway). The only thing to watch out for is a srcdir!=builddir case with automake, as in Git trees documentation is build inside builddir, while it’s shipped and available from within srcdir in tarballs.

Pros and Cons for shipping documentation

  • Con: Tarball sizes increase, but the size difference seems accaptable, practical tests show less than 50% increase in tarball sizes for documentation excluding generated graphics.
  • Con: Tarball source changes cannot be reflected in docs. This mostly affects packagers, it’d be nice to receive substantial patches in upstream as a remedy.
  • Pro: The build logic is significantly simplified, allowing a hard dependency on Git and skipping complex conditionals for tool availability.
  • Pro: Build time and complexity from tarballs is reduced. A nice side effect, considering the variety of documentation tools out there, sgml-tools, doxygen, gtk-doc, etc.

For me the pros in this case clearly outweigh the cons. I’m happy to hear about pros and cons I might have missed.

Prior Art?

Looking around the web for cases of other projects doing this didn’t turn up too many examples. There’s some probability that most projects don’t yet trade documentation generation rules for pre-generated documentation in tarballs.

If you know projects that turned to pre-generated documentation, please let me know about them.

I’m also very interested in end-user and packager’s opinions on this. Also, do people know about other materials that projects include pre-built in tarballs? And without the means to regenerate everything from just the tarballs?

Tweet about this on TwitterShare on Google+Share on LinkedInShare on FacebookFlattr the authorBuffer this pageShare on RedditDigg thisShare on VKShare on YummlyPin on PinterestShare on StumbleUponShare on TumblrPrint this pageEmail this to someone

  10 Responses to “Should We Include Documentation Builds In Tarballs?”

  1. IMHO, include them. The overall size package is quite riduculous relatively to Microsoft Office.

  2. How about providing an extra docs tar.bz2?
    I really think, that the most value of Doxygen is in the pictures (graphs generated by dot).
    Like, the files included, or hierarchy… I think, the docs without the pictures is not as good.

    An interesting approach would be to somehow patch Doxygen, to generate (and use) SVG’s instead of PNG. This should be much better compressible… Graphviz already supports SVG output

    • Hey monitor, thanks for pointing out how important images are for you. note that those are still provided in the online versoin of the docs, eg. here:
      You make a very interesting point about SVG though, definitely worth investigating!

      • Hello Tim!

        One more note: the pictures are important, but there’s one other nice feature of Doxygen, I like: to know, who calls the function:
        REFERENCED_BY_RELATION, so you can find out, e.g. how to use a specific function }to get a good example, or to know, who will be influenced by making change in this function).

        Furthermore, when trying to understand foreign code, it’s also nice to have SOURCE_BROWSER – it’s like annotated, hyperlinked, clickable sources (you can get maybe get something similar in IDEs, like Eclipse, with CTRL+click). But the version on the web could have this maybe too.

        (This needs of course changes in the doxygen-html target of the docs/Makefile.doxygen file.)

        • Hey again.

          Thanks for the tip, activating REFERENCED_BY_RELATION is indeed useful and the size overhead is negligible. About SOURCE_BROWSER, in my projects it is enabled for the header files at the very least.

  3. Oh, even Doxygen already supports the SVG output option.
    See the DOT_IMAGE_FORMAT option. It can be set to svg.

    Furthermore, some people report, that this can lead to quite dramatic decrease of documentation size:
    “resulting rpm is something like 200 MB vs 15 MB”.
    Here’s another project using the same svg option (DOT_IMAGE_FORMAT = svg) Reduce documentation size by more efficient image format

    • Great suggestion, for my Rapicorn tree, enabling PNG images results in some 430% size increments for tarballs, enabling SVG images only results in ca 12%! That practically allows shipping images in the tarball docs. Doxygen/dot runtime for SVG images also is only a fraction of the time needed for the PNG builds.

  4. Sane distributions like Debian will build them from source anyway, so I would suggest not including them at all.

  5. Hi Tim, you seem to imply that doxygen requires Qt (or maybe I misunderstood you). But AFAIK, this is not the case: only the doxygen UI wizard requires Qt. This is what I understand from the doxygen website, and also seems confirmed by ldd:

    mardy@devel:~$ ldd /usr/bin/doxygen => (0x00007fff3d578000) => /lib/x86_64-linux-gnu/ (0x00007fbd396fd000) => /usr/lib/x86_64-linux-gnu/ (0x00007fbd393f9000) => /lib/x86_64-linux-gnu/ (0x00007fbd390f3000) => /lib/x86_64-linux-gnu/ (0x00007fbd38edd000) => /lib/x86_64-linux-gnu/ (0x00007fbd38b15000)
    /lib64/ (0x00007fbd39944000)

    • Hi Alberto,

      you are right, thanks for pointing out the specifics of the Qt dependency. I just looked at the sources again, the QString/QStringList/etc uses in the Doxygen core are indeed now fulfilled by inclusion of corresponding Qt source copies.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>