Aug 062013

Reblogged from the Lanedo GmbH blog:

Documentation Tools

Would you want to invest hours or days into automake logic without a use case?

For two of the last software releases I did, I was facing this question. Let me give a bit of background. Recently the documentation generation for Beast and Rapicorn fully switched over to Doxygen. This has brought a number of advantages, such as graphs for the C++ inheritance of classes, build tests of documentation example code and integration with the Python documentation. What was left for improvement was a simplification of the build process and logic involved however.

Generating polished documentation is time consuming

Maintaining the documentation builds has become increasingly complex. One of the things adding to the complexity are increased dependencies, external tools are required for a number of features: E.g. Doxygen (and thus Qt) is required to build the main docs, dot is required for all sorts of graph generation, python scripts are used to auto-extract some documentation bits, rsync is used for incremental updates of the documentation, Git is used to extract interesting meta information like the documentation build version number or the commit logs, etc.

More complexity for tarballs?

For the next release, I faced the task of looking into making the documentation generation rules work for tarballs, outside of the Git repository. That means building in an environment significantly different from the usual development setup and toolchain (of which Git has become an important part). At the very least, this was required:

  • Creating autoconf rules to check for Doxygen and all its dependencies.
  • Require users to have a working Qt installation in order to build Rapicorn.
  • Deriving a documentation build version id without access to Git.
  • Getting the build dependencies right so we auto-build when Git is around but don’t break when Git’s not around.

All of this just for the little gain of enabling the normal documentation re-generation for someone wanting to start development off a tarball release.
Development based on tarballs? Is this a modern use case?

Development happens in Git

During this year’s LinuxTag, I’ve taken the chance to enter discussions and get feedback on development habits in 2013. Development based on tarballs certainly was the norm when I started in Free Software & Open Source, that was 1996. It’s totally not the case these days. A large number of projects moved to Git or the likes. Rapicorn and Beast have been moved to Git several years ago, we adopted the commit style of the Linux kernel and a GNU-style ChangeLog plus commit hash ids is auto-generated from Git for the tarballs.
Utilizing the meta information for a project living in Git comes naturally as time passes and projects get more familiar with Git. Examples are signed tags, scripts around branch-/merge-conventions, history greping or symbolic version id generation. Git also significantly improves spin-off developments which is why development of Git hosted projects generally happens in Git branches or Git clones these days. Sites like github encourage forking and pulling, going back to the inconveniences of tarball based development baring any history would be a giant leap backwards. In fact, these days tarballs serve as little more than a transport container for a specific snapshot of a Git repository.

Shipping pre-built Documentation

Taking a step back, it’d seem easier to avoid the hassle of adapting all the documentation build logic to work both ways, with and without Git, by simply including a copy of the readily built result into the tarball. Like everything, there’s a downside here as well of course, tarball size will increase significantly. Just how significantly? The actual size can make or break the deal, e.g. if it changed by orders of magnitude. Let’s take a look:

  • 6.0M – beast-0.8.0.tar.bz2
  • 23.0M – beast-full-docs.tar-bz2

Uhhh, that’s a lot. All the documentation for the next release totals around almost four times that of the last tarball size. That’s a bit excessive, can we do better?

It turns out that a large portion of the space in a full Doxygen HTML build is actually used up by images. Not the biggest chunk but a large one nevertheless, for the above example, we’re looking at:

  • 23M – du -hc full-docs/*png
  • 73M – du -hc full-docs/

So, 23 out of 73 MB for the images, that’s 32%. Doxygen doesn’t make it too hard to build without images, it just needs two configuration settings HAVE_DOT = NO and CLASS_DIAGRAMS = NO. Rebuilding the docs without any images also removes a number of image references, so we end up with:

  • 42M – slim-docs/

That’s a 42% reduction in documentation size. Actually that’s just plain text documentation now, without any pre-compressed PNG images. That means bzip2 could do a decent job at it, let’s give it a try:

  • 2.4M – beast-slim-docs.tar-bz2

Wow, that went better than expected, we’re just talking about 40% of the source code tarball at this point. Definitely acceptable, here’re the numbers for the release candidates in direct comparison, with and without pre-built documentation:

  • 6.1M – beast-no-docs-0.8.1-rc1.tar.bz2
  • 8.6M – beast-full-docs-0.8.1-rc1.tar.bz2

Disable Documentation generation rules in tarballs

Now that we’ve established that shipping documentation without graphs results in an acceptable tarball size increase, it’s easy to make the call to include full documentations with tarball releases. As a nice side effect, auto-generation of the documentation in tarballs can be disabled (not having the Git tree and other tools available, it’d be prone to fail anway). The only thing to watch out for is a srcdir!=builddir case with automake, as in Git trees documentation is build inside builddir, while it’s shipped and available from within srcdir in tarballs.

Pros and Cons for shipping documentation

  • Con: Tarball sizes increase, but the size difference seems accaptable, practical tests show less than 50% increase in tarball sizes for documentation excluding generated graphics.
  • Con: Tarball source changes cannot be reflected in docs. This mostly affects packagers, it’d be nice to receive substantial patches in upstream as a remedy.
  • Pro: The build logic is significantly simplified, allowing a hard dependency on Git and skipping complex conditionals for tool availability.
  • Pro: Build time and complexity from tarballs is reduced. A nice side effect, considering the variety of documentation tools out there, sgml-tools, doxygen, gtk-doc, etc.

For me the pros in this case clearly outweigh the cons. I’m happy to hear about pros and cons I might have missed.

Prior Art?

Looking around the web for cases of other projects doing this didn’t turn up too many examples. There’s some probability that most projects don’t yet trade documentation generation rules for pre-generated documentation in tarballs.

If you know projects that turned to pre-generated documentation, please let me know about them.

I’m also very interested in end-user and packager’s opinions on this. Also, do people know about other materials that projects include pre-built in tarballs? And without the means to regenerate everything from just the tarballs?

Jul 252013


During a conference some while ago, Jacob Appelbaum gave a talk on the usefulness of the Tor project, allowing you to browse anonymously, liberating speech online, enabling web access in censored countries, etc.

Jacob described how the anonymizing Tor network consists of many machines world wide that use encryption and run the Tor software, which are routing internet traffic and on the way anonymize it, and then traffic leaves the network at some random host so the original sender cannot be traced back. These hosts are called “exit nodes”.

At the end of his talk, he prompted the audience:
Why don’t you run an exit node yet?
I had been using Tor in the past on and off, and while I couldn’t agree more with the privacy goals and anti-censorship measures outlined, I never setup an exit node to help the network. And I do admin quite a number of hosted machines that have idle bandwidth available…

It took me a while to get round to it, but some months after that I started to set up the first exit node on a hosted virtual server. It took a while to get it all going, I made sure I read up the legal implications of running it in Germany, setup disclaimers on the host for people checking it’s port 80, etc. After half a day or so, I had it going, watched in the logs how it connected to the network and… let it run.

Traffic came in slowly at first, but after 1 or 2 days, the node’s presence had propagated through the net and it started to max out CPU and bandwidth limits as configured. So far so good, I was happy helping people all over the world browsing the net anonymously and especially helping folks in countries with internet censorship to access all the net. Great!

Or so I thought at least. It only took some 5 or so days for me to get an official notice to cease network activity on this host immediately. Complaints about Copyright infringement were cited as the reason. Turned out that the majority of the “liberating” traffic I was relaying were torrenting copyrighted material. I had checked out the Tor guidelines in advance, which are correctly outlining that in Germany the TMG (law on telecommunication media) paragraphs §8 and §15 are actually protecting me as a traffic router from liability for the actual traffic contents, so initially I assumed I’d be fine in case of claims.

It turned out the notice had a twist to it. It was actually my virtual server provider who sent that notice on behalf of a complaining party and argued that I was in violation of their general terms and conditions for purchasing hosting services. Checking those, the conditions read:
Use of the server to provide anonymity services is excluded.
Regardless of the TMG, I was in violation of the hosting provider’s terms and conditions which allowed premature termination of the hosting contract. At that point I had no choice but stopping the Tor services on this hosting instance.

All in all a dissatisfying experience, but at least I could answer Jacob’s question now:
I’m not running an exit node because it’s not uncommon for German providers to exclude the use of anonymity services on the merits.
I actually got back to Jacob in Email and suggested that a note be added to the TorExitGuidelines wiki page so future contributors know to check out the terms and conditions of their hosting services. It seems my request has been ignored up to this day, for one reason or another.

I’d still like to support the Tor network however, so for all savvy readers out there, I’m asking:

  • Do you have any provider recommendations where running Tor exit nodes is not an issue? (In Germany perhaps?)
  • Is it at all feasible to be running Tor exit nodes in Germany without having to set a legal budget aside to defend yourself against claims?

Jul 162013

In the last few days I finished reading the “Black Swan” by Nassim Nicholas Taleb. Around last January I saw Günther Palfinger mentioning it in my G+ stream, looked it up and bought it.

At first, the book seemed to present some interesting ideas on error statistics and the first 20 or 30 pages are giving good examples for conscious knowledge we posses but don’t apply in every day actions. Not having a trading history like the author, I found reading further until around page 100 to be a bit of a drag. Luckily I kept on, because after that Taleb started to finally get interesting for me.

Once upon a time…
One of the lectures I attended at university touched on black box analysis (in the context of modelling and implementation for computer programs). At first of course the usual and expected or known input/output behavior is noted, e.g. calculus it may perform or pattern recognition or any other domain specific function. But in order to find out hints about how it’s implemented, short of inspecting the guts which a black box won’t allow for, one needs to look at error behavior. I.e. examine the outputs in response to invalid/undefined/distorted/erroneous/unusual inputs and assorted response times. For a simple example, read a sheet of text and start rotating it while you continue reading. For untrained people, reading speed slows down as the rotation angle increases, indicating that the brain engages in counter rotation transformations which are linear in complexity with increasing angles.

At that point I started to develop an interest in error analysis and research around that field, e.g. leading to discoveries like the research around “error-friendliness” in technological or biological systems or discoveries of studies on human behavior which implies corollaries like:

  • To enable speedy and efficient decision making, humans generally rely on heuristics.
  • Displaying heuristic behavior, people must make errors by design. So trying to eliminate or punish all human error is futile, aiming for robustness and learning from errors instead is much better.
  • Perfectionism is anti-evolutionary, it is a dead end not worth striving for. For something “perfect” lacks flexibility, creativity, robustness and cannot be improved upon.

A Black Swan?
Now “Black Swan” defines the notion of a high-impact, low-probability event, e.g. occurring in financial trading, people’s wealth or popularity – events from an extreme realm. That’s in contrast to normally distributed encounters like outcomes of a dice game, people’s body size or the number of someone’s relatives – encounters from a mediocre realm.

From Mediocre…
Here’s a short explanation for the mediocre realms. Rolling a regular dice will never give a number higher than 6 no matter how often it’s thrown. In fact, the more it’s thrown, the more even it’s numbers are distributed and the clearer its average emerges. Measuring people’s weight or number of relatives shows a similar pattern to throwing a dice, the more measurements are encountered the more certain the average becomes. Any new encounter is going to have lesser and lesser impact on the average of the total as the number of measurements increases.

To Extreme…
On the other hand there are the extreme realms. In trading or wealth or popularity, a single encounter can outweigh the rest of the distribution by several orders of magnitude. Most people have an annual income of less than $100k, but the tiny fraction of society that earns more in annual income possesses more than 50% of the entire distribution of wealth. A similar pattern exists with popularity, only very few people are so popular that they’re known by hundreds of thousands or maybe millions of people. But only very very few people are super popular so they’re known by billions. Averaging over a given set only works for so long, until a high-impact “outlier” is encountered that dominates the entire distribution. Averaging the popularity of hundreds of thousands of farmers, industrial workers or local mayors cannot account for the impact on the total popularity distribution by the encounter of a single Mahatma Gandhi.

On Errors
Taleb is spending a lot of time in the book on condemning the application of the Gauss distribution in fields that are prone to extreme encounters especially economics. Rightfully so, but I would have enjoyed learning more about examples of fields that are from the extreme realms and not widely recognized as such. The crux of the inapplicability of the Gauss distribution in the extreme realms lies in two things:

  1. Small probabilities are not accurately computable from sample data, at least not accurately enough to allow for precise decision making. The reason is simple, since the probabilities of rare events are very small, there simply cannot be enough data present to match any distribution model with high confidence.
  2. Rare events that have huge impact, enough impact to outweigh the cumulative effect of all other distribution data, are fundamentally non-Gaussian. Fractal distributions may be useful to retrofit a model to such data, but don’t allow for accurate predictability. We simply need to integrate the randomness and uncertainty of these events into our decision making process.

Aggravation in the Modern Age
Now Taleb very forcefully articulates what he thinks about economists applying mathematical tools from the mediocre realms (Gauss distribution, averaging, disguising uncertain forecasts as “risk measurements”, etc) to extreme realm encounters like trade results and if you look for that, you’ll find plenty of well pointed criticism in that book. But what struck me as very interesting and a new excavation in an analytical sense is that our trends towards globalisation and high interconnectedness which yield ever growing and increasingly bigger entities (bigger corporations, bigger banks, quicker and greater popularity, etc) are building up the potential for rare events to have higher and higher impacts. E.g. an eccentric pop song can make you much more popular these days on the Internet than TV could do for you 20 years ago. A small number of highly interconnected banks these days have become so big that they “cannot be allowed to fail”.

We are all Human
Considering how humans are essentially functioning as heuristic and not precise systems (and for good reasons), every human inevitably will commit mistakes and errors at some point and to some lesser or larger degree. Now admitting we all error once in a while, exercising a small miscalculation during grocery shopping, buying a family house, budgeting a 100 people company, leading a multi-million people country or operating a multi-trillion currency reserve bank has of course vastly different consequences.

What really got me
So the increasing centralisation and increasing growth of giant entities ensures that todays and future miscalculations are disproportionally exponentiated. In addition, use of the wrong mathematical tools ensures miscalculations won’t be small, won’t be rare, their frequency is likely to increase.

Notably, global connectedness alerts the conditions for Black Swan creation, both in increasing frequency and increasing impact whether positive or negative. That’s like our modern society is trying to balance a growing upside down pyramid of large, ever increasing items on top of its head. At some point it must collapse and that’s going to hurt, a lot!

Take Away
The third edition of the book closes with essays and commentary that Taleb wrote after the the first edition and in response to critics and curios questions. I’m always looking for relating things to practical applications, so I’m glad I got the third edition and can provide my personal highlights to take away from Taleb’s insights:

  1. Avoid predicting rare events
    The frequency of rare events cannot be estimated from empirical observation because of their very rareness (i.e. calculation error margin becomes too big). Thus the probability of high impact rare events cannot be computed with certainty, but because of the high impact it’s not affordable to ignore them.
  2. Limit Gauss distribution modeling
    Application of the Gauss distribution needs to be limited to modelling mediocre realms (where significant events have a high enough frequency and rare events have insignificant impact); it’s unfortunately too broadly abused, especially in economics.
  3. Focus on impact but not probability
    It’s not useful to focus on the probability of rare events since that’s uncertain. It’s useful to focus on the potential impact instead. That can mean to identify hidden risks or to invest small efforts to enable potentially big gains. I.e. always consider the return-on-investment ratio of activities.
  4. Rare events are not alike (atypical)
    Since probability and accurate impact of remote events are not computable, reliance on rare impacts of specific size or around specific times is doomed to fail you. Consequently, beware of others making related predictions and/or others relying them.
  5. Strive for variety in your endeavors
    Avoiding overspecialization, learning to love redundancy as well as broadening one’s stakes reduces the effect any single “bad” Black Swan event can have (increases robustness) and variety might enable some positive Black Swan events as well.

What’s next?
The Black Swan idea sets the stage for further investigations, especially investigation of new fields for applicability of the idea. Fortunately, Nassim Taleb continues his research work and has meanwhile published a new book “Antifragile – Things that Gain from Disorder”. It’s already lying next to me while I’m typing and I’m happily looking forward to reading it. 😉

The notion of incomputable rare but consequential events or “errors” is so ubiquitous that many other fields should benefit from applying “Black Swan”- or Antifragile-classifications and corresponding insights. Nassim’s idea to increase decentralization on the state level to combat escalation of error potentials at centralized institutions has very concrete applications at the software project management level as well. In fact the Open Source Software community has long benefited from decentralized development models and through natural organization avoided giant pitfall creation that occur with top-down waterfall development processes.

Algorithms may be another field where the classifications could be very useful. Most computer algorithm implementations are fragile due to high optimization for efficiency. Identifying these can help in making implementations more robust, e.g. by adding checks for inputs and defining sensible fallback behavior in error scenarios. Identifying and developing new algorithms with antifragility in mind should be most interesting however, good examples are all sorts of caches (they adapt according to request rates and serve cached bits faster), or training of pattern recognition components where the usefulness rises and falls with the variety and size of the input data sets.

The book “Black Swan” is definitely a highly recommended read. However make sure you get the third edition that has lots of very valuable treatment added on at the end, and don’t hesitate to skip a chapter or two if you find the text too involved or side tracking every once in a while. Taleb himself gives advice in several places in the third edition about sections readers might want to skip over.

Have you read the “Black Swan” also or heard of it? I’d love to hear if you’ve learned from this or think it’s all nonsense. And make sure to let me know if you’ve encountered Black Swans in contexts that Nassim Taleb has not covered!

Feb 082013

Determine candidates and delete from a set of directories containing aging backups.
As a follow up to the release of last December, here’s a complimentary tool we’re using at Lanedo. Suppose a number of backup directories have piled up after a while, using or any other tool that creates time stamped file names:

 drwxrwxr-x etc-2010-02-02-06:06:01-snap
 drwxrwxr-x etc-2011-07-07-06:06:01-snap
 drwxrwxr-x etc-2011-07-07-12:45:53-snap
 drwxrwxr-x etc-2012-12-28-06:06:01-snap
 drwxrwxr-x etc-2013-02-02-06:06:01-snap
 lrwxrwxrwx etc-current -> etc-2012-12-28-06:06:01-snap

Which file should be deleted once the backup device starts to fill up?
Sayepurge parses the timestamps from the names of this set of backup directories, computes the time deltas, and determines good deletion candidates so that backups are spaced out over time most evenly. The exact behavior can be tuned by specifying the number of recent files to guard against deletion (-g), the number of historic backups to keep around (-k) and the maximum number of deletions for any given run (-d). In the above set of files, the two backups from 2011-07-07 are only 6h apart, so they make good purging candidates, example:

 $ -o etc -g 1 -k 3 
 Ignore: ./etc-2013-02-02-06:06:01-snap
 Purge:  ./etc-2011-07-07-06:06:01-snap
 Keep:   ./etc-2012-12-28-06:06:01-snap
 Keep:   ./etc-2011-07-07-12:45:53-snap
 Keep:   ./etc-2010-02-02-06:06:01-snap

For day to day use, it makes sense to use both tools combined e.g. via crontab. Here’s a sample command to perform daily backups of /etc/ and then keep 6 directories worth of daily backups stored in a toplevel directory for backups:

 /bin/ -q -C /backups/ -o etc /etc/ && /bin/ -q -o etc -g 3 -k 3

Let me know in the comments what mechanisms you are using to purge aging backups!


The GitHub release tag is here: backups-0.0.2
Script URL for direct downloads:

Usage: [options] sources...
  --inc         merge incremental backups
  -g <nguarded> recent files to guard (8)
  -k <nkeeps>   non-recent to keep (8)
  -d <maxdelet> maximum number of deletions
  -C <dir>      backup directory
  -o <prefix>   output directory name (default: 'bak')
  -q, --quiet   suppress progress information
  --fake        only simulate deletions or merges
  -L            list all backup files with delta times
  Delete candidates from a set of aging backups to spread backups most evenly
  over time, based on time stamps embedded in directory names.
  Backups older than <nguarded> are purged, so that only <nkeeps> backups
  remain. In other words, the number of backups is reduced to <nguarded>
  + <nkeeps>, where <nguarded> are the most recent backups.
  The puring logic will always pick the backup with the shortest time
  distance to other backups. Thus, the number of <nkeeps> remaining
  backups is most evenly distributed across the total time period within
  which backups have been created.
  Purging of incremental backups happens via merging of newly created
  files into the backups predecessor. Thus merged incrementals may
  contain newly created files from after the incremental backups creation
  time, but the function of reverse incremental backups is fully
  preserved. Merged incrementals use a different file name ending (-xinc).
See Also – deduplicating backups with rsync

Jan 252013


Performance of a C++11 Signal System

First, a quick intro for for the uninitiated, signals in this context are structures that maintain a lists of callback functions with arbitrary arguments and assorted reentrant machinery to modify the callback lists and calling the callbacks. These allow customization of object behavior in response to signal emissions by the object (i.e. notifying the callbacks by means of invocations).

Over the years, I have rewritten each of GtkSignal, GSignal and Rapicorn::Signal at least once, but most of that is long a time ago, some more than a decade. With the advent of lambdas, template argument lists and std::function in C++11, it became time for me to dive into rewriting a signal system once again.

So for the task at hand, which is mainly to update the Rapicorn signal system to something that fits in nicely with C++11, I’ve settled on the most common signal system requirements:

  • Signals need to support arbitrary argument lists.
  • Signals need to provide single-threaded reentrancy, i.e. it must be possible to connect and disconnect signal handlers and re-emit a signal while it is being emitted in the same thread. This one is absolutely crucial for any kind of callback list invocation that’s meant to be remotely reliable.
  • Signals should support non-void return values (of little importance in Rapicorn but widely used elsewhere).
  • Signals can have return values, so they should support collectors (i.e. GSignal accumulators or boost::signal combiners) that control which handlers are called and what is returned from the emission.
  • Signals should have only moderate memory impact on class instances, because at runtime many instances that support signal emissions will actually have 0 handlers connected.

For me, the result is pretty impressive. With C++11 a simple signal system that fullfils all of the above requirements can be implemented in less than 300 lines in a few hours, without the need to resort to any preprocessor magic, scripted code generation or libffi.

I say “simple”, because over the years I’ve come to realize that many of the bells and whistles as implemented in GSignal or boost::signal2 don’t matter much in my practical day to day programming, such as the abilities to block specific signal handlers, automated tracking of signal handler argument lifetimes, emissions details, restarts, cancellations, cross-thread emissions, etc.

Beyond the simplicity that C++11 allows, it’s of course the performance that is most interesting. The old Rapicorn signal system (C++03) comes with its own set of callback wrappers named “slot” which support between 0 and 16 arguments, this is essentially mimicking std::function. The new C++11 std::function implementation in contrast is opaque to me, and supports an unlimited number of arguments, so I was especially curious to see the performance of a signal system based on it.

I wrote a simple benchmark that just measures the times for a large number of signal emissions with negligible time spent in the actual handler.

I.e. the signal handler just does a simple uint64_t addition and returns. While the scope of this benchmark is clearly very limited, it serves quite well to give an impression of the overhead associated with the emission of a signal system, which is the most common performance relevant aspect in practical use.

Without further ado, here are the results of the time spent per emission (less is better) and memory overhead for an unconnected signal (less is better):

Signal System   Emit() in nanoseconds Static Overhead Dynamic Overhead
GLib GSignal 341.156931ns   0   0
Rapicorn::Signal, old  178.595930ns  64   0
boost::signal2   92.143549ns  24  400 (=265+7+8*16)
boost::signal   62.679386ns  40   392 (=296+6*16)
Simple::Signal, C++11    8.599794ns   8   0
Plain Callback    1.878826ns   –   –


Here, “Plain Callback” indicates the time spent on the actual workload, i.e. without any signal system overhead, all measured on an Intel Core i7 at 2.8GHz. Considering the workload, the performance of the C++11 Signals is probably close to ideal, I’m more than happy with its performance. I’m also severely impressed with the speed that std::function allows for, I was originally expecting it to be at least a magnitude larger.

The memory overhead gives accounts on a 64bit platform for a signal with 0 connections after its constructor has been called. The “static overhead” is what’s usually embedded in a C++ instance, the “dynamic overhead” is what the embedded signal allocates with operator new in its constructor (the size calculations correspond to effective heap usage, including malloc boundary marks).

The reason GLib’s GSignal has 0 static and 0 dynamic overhead is that it keeps track of signals and handlers in a hash table and sorted arrays, which only consume memory per (instance, signal, handler) triplet, i.e. instances without any signal handlers really have 0 overall memory impact.


  • If you need inbuilt thread safety plus other bells and can spare lots of memory per signal, boost::signal2 is the best choice.
  • For tight scenarios without any spare byte per instance, GSignal will treat your memory best.
  •  If you just need raw emission speed and can spare the extra whistles, the C++11 single-file excels.

For the interested, the brief C++11 signal system implementation can be found here:
The API docs for the version that went into Rapicorn are available here: aidasignal.hh

PS: In retrospect I need to add, this day and age, the better trade-off for Glib could be one or two pointers consumed per instance and signal, if those allowed emission optimizations by a factor of 3 to 5. However, given its complexity and number of wrapping layers involved, this might be hard to accomplish.

Dec 012012

Due to popular request, I’m putting up a polished version of the backup script that we’ve been using over the years at Lanedo to backup our systems remotely. This script uses a special feature of rsync(1) v2.6.4 for the creation of backups which share storage space with previous backups by hard-linking files.
The various options needed for rsync and ssh to minimize transfer bandwidth over the Internet, time-stamping for the backups and handling of several rsync oddities warranted encapsulation of the logic into a dedicated script.


The GitHub release tag is here: backups-0.0.1
Script URL for direct downloads:


This example shows creation of two consecutive backups and displays the sizes.

$ -i ~/.ssh/id_examplecom # create backup as bak-.../mydir
$ -i ~/.ssh/id_examplecom # create second bak-2012...-snap/
$ ls -l # show all the backups that have been created
drwxrwxr-x 3 user group 4096 Dez  1 03:16 bak-2012-12-01-03:16:50-snap
drwxrwxr-x 3 user group 4096 Dez  1 03:17 bak-2012-12-01-03:17:12-snap
lrwxrwxrwx 1 user group   28 Dez  1 03:17 bak-current -> bak-2012-12-01-03:17:12-snap
$ du -sh bak-* # the second backup is smaller due to hard links
4.1M    bak-2012-12-01-03:16:50-snap
128K    bak-2012-12-01-03:17:12-snap
4.0K    bak-current
Usage: [options] sources...
  --inc         make reverse incremental backup
  --dry         run and show rsync with --dry-run option
  --help        print usage summary
  -C <dir>      backup directory (default: '.')
  -E <exclfile> file with rsync exclude list
  -l <account>  ssh user name to use (see ssh(1) -l)
  -i <identity> ssh identity key file to use (see ssh(1) -i)
  -P <sshport>  ssh port to use on the remote system
  -L <linkdest> hardlink dest files from <linkdest>/
  -o <prefix>   output directory name (default: 'bak')
  -q, --quiet   suppress progress information
  -c            perform checksum based file content comparisons
  -x            disable crossing of filesystem boundaries
  --version     script and rsync versions
  This script creates full or reverse incremental backups using the
  rsync(1) command. Backup directory names contain the date and time
  of each backup run to allow sorting and selective pruning.
  At the end of each successful backup run, a symlink '*-current' is
  updated to always point at the latest backup. To reduce remote file
  transfers, the '-L' option can be used (possibly multiple times) to
  specify existing local file trees from which files will be
  hard-linked into the backup.
 Full Backups:
  Upon each invocation, a new backup directory is created that contains
  all files of the source system. Hard links are created to files of
  previous backups where possible, so extra storage space is only required
  for contents that changed between backups.
 Incremental Backups:
  In incremental mode, the most recent backup is always a full backup,
  while the previous full backup is degraded to a reverse incremental
  backup, which only contains differences between the current and the
  last backup.
 RSYNC_BINARY Environment variable used to override the rsync binary path.
See Also

Testbit Tools – Version 11.09 Release

Nov 232012

For a while now, I’ve been maintaining my todo lists as backlogs in a Mediawiki repository. I’m regularly deriving sprints from these backlogs for my current task lists. This means identifying important or urgent items that can be addressed next, for really huge backlogs this can be quite tedious.

A SpecialPage extension that I’ve recently implemented now helps me through the process. Using it, I’m automatically getting a filtered list of all “IMPORTANT:”, “URGENT:” or otherwise classified list items. The special page can be used per-se or via template inclusion from another wiki page. The extension page at has more details.

The Mediawiki extension page is here:

The GitHub page for downloads is here: