Feb 082013

Determine candidates and delete from a set of directories containing aging backups.
As a follow up to the release of sayebackup.sh last December, here’s a complimentary tool we’re using at Lanedo. Suppose a number of backup directories have piled up after a while, using sayebackup.sh or any other tool that creates time stamped file names:

 drwxrwxr-x etc-2010-02-02-06:06:01-snap
 drwxrwxr-x etc-2011-07-07-06:06:01-snap
 drwxrwxr-x etc-2011-07-07-12:45:53-snap
 drwxrwxr-x etc-2012-12-28-06:06:01-snap
 drwxrwxr-x etc-2013-02-02-06:06:01-snap
 lrwxrwxrwx etc-current -> etc-2012-12-28-06:06:01-snap

Which file should be deleted once the backup device starts to fill up?
Sayepurge parses the timestamps from the names of this set of backup directories, computes the time deltas, and determines good deletion candidates so that backups are spaced out over time most evenly. The exact behavior can be tuned by specifying the number of recent files to guard against deletion (-g), the number of historic backups to keep around (-k) and the maximum number of deletions for any given run (-d). In the above set of files, the two backups from 2011-07-07 are only 6h apart, so they make good purging candidates, example:

 $ sayepurge.sh -o etc -g 1 -k 3 
 Ignore: ./etc-2013-02-02-06:06:01-snap
 Purge:  ./etc-2011-07-07-06:06:01-snap
 Keep:   ./etc-2012-12-28-06:06:01-snap
 Keep:   ./etc-2011-07-07-12:45:53-snap
 Keep:   ./etc-2010-02-02-06:06:01-snap

For day to day use, it makes sense to use both tools combined e.g. via crontab. Here’s a sample command to perform daily backups of /etc/ and then keep 6 directories worth of daily backups stored in a toplevel directory for backups:

 /bin/sayebackup.sh -q -C /backups/ -o etc /etc/ && /bin/sayepurge.sh -q -o etc -g 3 -k 3

Let me know in the comments what mechanisms you are using to purge aging backups!


The GitHub release tag is here: backups-0.0.2
Script URL for direct downloads: sayepurge.sh

Usage: sayepurge.sh [options] sources...
  --inc         merge incremental backups
  -g <nguarded> recent files to guard (8)
  -k <nkeeps>   non-recent to keep (8)
  -d <maxdelet> maximum number of deletions
  -C <dir>      backup directory
  -o <prefix>   output directory name (default: 'bak')
  -q, --quiet   suppress progress information
  --fake        only simulate deletions or merges
  -L            list all backup files with delta times
  Delete candidates from a set of aging backups to spread backups most evenly
  over time, based on time stamps embedded in directory names.
  Backups older than <nguarded> are purged, so that only <nkeeps> backups
  remain. In other words, the number of backups is reduced to <nguarded>
  + <nkeeps>, where <nguarded> are the most recent backups.
  The puring logic will always pick the backup with the shortest time
  distance to other backups. Thus, the number of <nkeeps> remaining
  backups is most evenly distributed across the total time period within
  which backups have been created.
  Purging of incremental backups happens via merging of newly created
  files into the backups predecessor. Thus merged incrementals may
  contain newly created files from after the incremental backups creation
  time, but the function of reverse incremental backups is fully
  preserved. Merged incrementals use a different file name ending (-xinc).
See Also

Sayebackup.sh – deduplicating backups with rsync

  4 Responses to “Sayepurge.sh – determine deletion of aging backups”

  1. This looks similar to http://www.spinellis.gr/sw/unix/fileprune/ and I wish I’d be be able to use features from both: Parsing dates and deleting whole directoreis from the filename from you with the nice exponential deletion from your tool. Maybe you can get some good ideas from fileprune.

  2. After a hard drive crash (…), I set up a backup system. It’s very primitive: every now and then I make plain copies on two external drives. Now I am facing the space shortage issue, so I have been wondering what to delete. I am not completely satisfied with automatic deletions, because I think that I may delete a file which does not exist in the surrounding snapshots:
    snap-1 : file in draft state
    snap-2 : file in its final state
    snap-3 : file is deleted.
    If I delete snap-2, I loose that file that interests me.

    So I have been considering two types of snapshots, the automatic regular ones, and the ones I trigger for a specific reason. To make the parallel with a versioning system, I see automatic backups as (micro) commits on my working copy, and triggered backups as commit/merge to the main branch. A good reason to trigger a backup is a clear milestone of a project. Then automatic snapshots are mostly only useful between the last triggered snapshot and now. Basically all other automatic ones could be deleted if space is lacking. Ideally the triggered snapshot could have a label (as a commit message) to remind the general state of the snapshot, and how it differs from the previous triggered one.

    Does it make sense?
    This is just my thoughts as they are now, as I am facing this space shortage problem recently, which I may simply solve (postpone) by buying a bigger drive…

    • Hey Luc. These days, given HDD sizes and prices, just buying a bigger drive is generally the fastest and (in terms of time) often also cheapest option 😉
      That said, we’re using sayepurge.sh so it keeps around a set of recent copies untouched, which means for your case:
      snap-1 : file in draft state
      snap-2 : file in its final state
      snap-3 : file is hard-linked.
      snap-4 : file is hard-linked.
      snap-5 : file is hard-linked.
      snap-6 : file is accidentally deleted.

      The last 3 to 4 backups are usually kept, which means if 1/2/3 get deleted, the latest version usually is still recoverable. The deletion would have to go by unnoticed for a long time, in order for the file to actually vanish from backups as well.

      If it comes to projects however, I agree you need full version control for that with fine grained commits. No backup scheme can serve as a substitute here. Daily backups are good for making sure the version control system itself isn’t lost though. 😉

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>