Nov 132013
 

Reblogged from the Lanedo GmbH blog:

Tobin Daily Visits Screenshot

Tobin Daily Visits Screenshot

During recent weeks, I’ve started to create a new tool “Tobin” to generate website statistics for a number of sites I’m administrating or helping with. I’ve used programs like Webalizer, Visitors, Google Analytics and others for a long time, but there’re some correlations and relationships hidden in web server log files that are hard or close to impossible to visualize with these tools.

The Javascript injections required by Google Analytics are not always acceptable and fall short of accounting for all traffic and transfer volume. Also some of the log files I’m dealing with are orders of magnitudes larger than available memory, so the processing and aggregation algorithms used need to be memory efficient.

Here is what Tobin currently does:

  1. Input records are read and sorted on disk, inputs are filtered for a specific year.
  2. A 30 minute window is used to determine visits via unique IP-address and UserAgent.
  3. Hits, such as images, CSS files, WordPress resource files, etc are filtered to derive page counts.
  4. Statistics such as per hour accounting and geographical origin are collected.
  5. Top-50 charts and graphs are generated in an HTML report from the collected statistics.

There is lots of room for future improvements, e.g. creation of additional modules for new charts and graphs, possibly accounting across multiple years, use of intermediate files to speed up processing and more. In any case, the current state works well for giant log files and already provides interesting graphs. The code right now is alpha quality, i.e. ready for a technical preview but it might still have some quirks. Feedback is welcome.

The tobin source code and tobin issue tracker are hosted on Github. Python 2.7 and a number of auxiliary modules are required. Building and testing works as follows:

$ git clone https://github.com/tim-janik/tobin.git
$ make # create tobin executable
$ ./tobin -n MySiteName mysite.log
$ x-www-browser logreport/index.html

Leave me a comment if you have issues testing it out and let me know how the report generation works for you. 😉

Feb 082013
 
About

Determine candidates and delete from a set of directories containing aging backups.
As a follow up to the release of sayebackup.sh last December, here’s a complimentary tool we’re using at Lanedo. Suppose a number of backup directories have piled up after a while, using sayebackup.sh or any other tool that creates time stamped file names:

 drwxrwxr-x etc-2010-02-02-06:06:01-snap
 drwxrwxr-x etc-2011-07-07-06:06:01-snap
 drwxrwxr-x etc-2011-07-07-12:45:53-snap
 drwxrwxr-x etc-2012-12-28-06:06:01-snap
 drwxrwxr-x etc-2013-02-02-06:06:01-snap
 lrwxrwxrwx etc-current -> etc-2012-12-28-06:06:01-snap

Which file should be deleted once the backup device starts to fill up?
Sayepurge parses the timestamps from the names of this set of backup directories, computes the time deltas, and determines good deletion candidates so that backups are spaced out over time most evenly. The exact behavior can be tuned by specifying the number of recent files to guard against deletion (-g), the number of historic backups to keep around (-k) and the maximum number of deletions for any given run (-d). In the above set of files, the two backups from 2011-07-07 are only 6h apart, so they make good purging candidates, example:

 $ sayepurge.sh -o etc -g 1 -k 3 
 Ignore: ./etc-2013-02-02-06:06:01-snap
 Purge:  ./etc-2011-07-07-06:06:01-snap
 Keep:   ./etc-2012-12-28-06:06:01-snap
 Keep:   ./etc-2011-07-07-12:45:53-snap
 Keep:   ./etc-2010-02-02-06:06:01-snap

For day to day use, it makes sense to use both tools combined e.g. via crontab. Here’s a sample command to perform daily backups of /etc/ and then keep 6 directories worth of daily backups stored in a toplevel directory for backups:

 /bin/sayebackup.sh -q -C /backups/ -o etc /etc/ && /bin/sayepurge.sh -q -o etc -g 3 -k 3

Let me know in the comments what mechanisms you are using to purge aging backups!

Resources

The GitHub release tag is here: backups-0.0.2
Script URL for direct downloads: sayepurge.sh

Usage
Usage: sayepurge.sh [options] sources...
OPTIONS:
  --inc         merge incremental backups
  -g <nguarded> recent files to guard (8)
  -k <nkeeps>   non-recent to keep (8)
  -d <maxdelet> maximum number of deletions
  -C <dir>      backup directory
  -o <prefix>   output directory name (default: 'bak')
  -q, --quiet   suppress progress information
  --fake        only simulate deletions or merges
  -L            list all backup files with delta times
DESCRIPTION:
  Delete candidates from a set of aging backups to spread backups most evenly
  over time, based on time stamps embedded in directory names.
  Backups older than <nguarded> are purged, so that only <nkeeps> backups
  remain. In other words, the number of backups is reduced to <nguarded>
  + <nkeeps>, where <nguarded> are the most recent backups.
  The puring logic will always pick the backup with the shortest time
  distance to other backups. Thus, the number of <nkeeps> remaining
  backups is most evenly distributed across the total time period within
  which backups have been created.
  Purging of incremental backups happens via merging of newly created
  files into the backups predecessor. Thus merged incrementals may
  contain newly created files from after the incremental backups creation
  time, but the function of reverse incremental backups is fully
  preserved. Merged incrementals use a different file name ending (-xinc).
See Also

Sayebackup.sh – deduplicating backups with rsync

Dec 012012
 
About

Due to popular request, I’m putting up a polished version of the backup script that we’ve been using over the years at Lanedo to backup our systems remotely. This script uses a special feature of rsync(1) v2.6.4 for the creation of backups which share storage space with previous backups by hard-linking files.
The various options needed for rsync and ssh to minimize transfer bandwidth over the Internet, time-stamping for the backups and handling of several rsync oddities warranted encapsulation of the logic into a dedicated script.

Resources

The GitHub release tag is here: backups-0.0.1
Script URL for direct downloads: sayebackup.sh

Example

This example shows creation of two consecutive backups and displays the sizes.

$ sayebackup.sh -i ~/.ssh/id_examplecom user@example.com:mydir # create backup as bak-.../mydir
$ sayebackup.sh -i ~/.ssh/id_examplecom user@example.com:mydir # create second bak-2012...-snap/
$ ls -l # show all the backups that have been created
drwxrwxr-x 3 user group 4096 Dez  1 03:16 bak-2012-12-01-03:16:50-snap
drwxrwxr-x 3 user group 4096 Dez  1 03:17 bak-2012-12-01-03:17:12-snap
lrwxrwxrwx 1 user group   28 Dez  1 03:17 bak-current -> bak-2012-12-01-03:17:12-snap
$ du -sh bak-* # the second backup is smaller due to hard links
4.1M    bak-2012-12-01-03:16:50-snap
128K    bak-2012-12-01-03:17:12-snap
4.0K    bak-current
Usage
Usage: sayebackup.sh [options] sources...
OPTIONS:
  --inc         make reverse incremental backup
  --dry         run and show rsync with --dry-run option
  --help        print usage summary
  -C <dir>      backup directory (default: '.')
  -E <exclfile> file with rsync exclude list
  -l <account>  ssh user name to use (see ssh(1) -l)
  -i <identity> ssh identity key file to use (see ssh(1) -i)
  -P <sshport>  ssh port to use on the remote system
  -L <linkdest> hardlink dest files from <linkdest>/
  -o <prefix>   output directory name (default: 'bak')
  -q, --quiet   suppress progress information
  -c            perform checksum based file content comparisons
  --one-file-system
  -x            disable crossing of filesystem boundaries
  --version     script and rsync versions
DESCRIPTION:
  This script creates full or reverse incremental backups using the
  rsync(1) command. Backup directory names contain the date and time
  of each backup run to allow sorting and selective pruning.
  At the end of each successful backup run, a symlink '*-current' is
  updated to always point at the latest backup. To reduce remote file
  transfers, the '-L' option can be used (possibly multiple times) to
  specify existing local file trees from which files will be
  hard-linked into the backup.
 Full Backups:
  Upon each invocation, a new backup directory is created that contains
  all files of the source system. Hard links are created to files of
  previous backups where possible, so extra storage space is only required
  for contents that changed between backups.
 Incremental Backups:
  In incremental mode, the most recent backup is always a full backup,
  while the previous full backup is degraded to a reverse incremental
  backup, which only contains differences between the current and the
  last backup.
 RSYNC_BINARY Environment variable used to override the rsync binary path.
See Also

Testbit Tools – Version 11.09 Release

Oct 012011
 
No Bugs

(Image: Mag3737)

 

And here’s another muffin from the code cake factory…

About Testbit Tools
The ‘Testbit Tools’ package contains tools proven to be useful during the development of several Testbit and Lanedo projects. The tools are Free Software and can be redistributed under the GNU GPLv3+.

This release features the addition of buglist.py, useful to aid in report and summary generation from your favorite bugzilla.

Downloading Testbit Tools
The Testbit Tools packages are available for download in the testbit-tools folder, the newest release is here: testbit-tools-11.09.0.tar.bz2

Changes in version 11.09.0:

  • Added buglist, a script to list and download bugs from bug trackers.
  • Added buildfay, a script with various sub commands to aid release making.
  • Fixed version information for all tools.
  • Added support to the Xamarin Bug Tracker to buglist.py.
  •  

    Feedback

    If you find this release useful, we highly appreciate your feature requests, bug reports, patches or review comments!

    See Also

    1. The Bugzilla Utility buglist.py – managing bug lists
    2. Wikihtml2man Introduction – using html2wiki

    May 132011
     
    Wiki↠HTML↠Man

     

    What’s this?
    Wikihtml2man is an easy to use converter that parses HTML sources, normally originating from a Mediawiki page, and generates Unix Manual Page sources based on it (also referred to as html2man or wiki2man converter). It allows developing project documentation online, e.g. by collaborating in a wiki. It is released as free software under the GNU GPLv3. Technical details are given in its manual page: Wikihtml2man.1.

    Why move documentation online?
    Google turns up a few alternative implementations, but none seem to be designed as a general purpose tool. With the ubiquituous presence of wikis on the web these days and the ease of content authoring they provide, we’ve decided to move manual page authoring online for the Beast project. Using Mediawiki, manual pages turn out to be very easily created in a wiki, all that’s then needed is a backend tool that can generate Manual Page sources from a wiki page. Wikihtml2man provides this functionality based on the HTML generated from wiki pages, it can convert a prerendered HTML file or download the wiki page from a specific URL. HTML has been choosen as input format to support arbitrary wiki features like page inclusion or macro expansion and to potentially allow page generation from other wikis than MediaWiki. Since wikihtml2man is based purely on HTML input, it is of course also possible to write the Manual Page in raw HTML, using tags such as h1, strong, dt, dd, li, etc, but that’s really much less convenient to use than a regular wiki engine.

    What are the benefits?
    For Beast, the benefits of moving some project documentation into an online wiki are:

    • We increase editability by lifting review requirements.
    • We are getting quicker edit/view turnarounds, e.g. through use of page preview functionality in wikis.
    • We allow assimilation of user contributions from non-programmers for our documentation.
    • Easier editability may lead to richer documentation and possibly better/frequently updated documentation.
    • Other projects also seem to make good progress by opening up some development parts to online web interfaces, like: Pootle translations, Transifex translations or PHP.net User Notes.

    What are the downsides?
    We have only recently moved our pages online and still need to gather some experience with the process. So far possible downsides we see are:

    • Sources and documentation can more easily get out of sync if they don’t reside in the same tree. We hope to be mitigating this by increasing documentation update frequencies.
    • Confusion about revision synchronization, with the source code using a different versioning system than the online wiki. We are currently pondering automated re-integration into the tree to counteract this problem.

    How to use it?
    Here’s wikihtml2man in action, converting its own manual page and rendering it through man(1):

      wikihtml2man.py http://testbit.eu/Wikihtml2man.1?action=render | man -l -

    Where to get it?
    Release tarballs shipping wikihtml2man are kept here: http://dist.testbit.eu/testbit-tools/.
    Our Tools page contains more details about the release tarballs.

    Have feedback or questions?
    If you can put wikihtml2man to good use, have problems running it or other ideas about it, feel free to drop me a line about it. Alternatively you can also add your feedback and any feature requests to the Feature Requests page (a forum will be created if there’s any actual demand).

    What’s related?
    We would also like to hear from other people involved in projects that are using/considering wikis to build production documentation online (e.g. in manners similar to Wikipedia). So please leave a comment and tell us about it if you do something similar.

    See Also

    1. New Beast Website – using html2wiki
    2. The Beast Documentation Quest – looking into documentation choices

    Oct 202008
     

    Managing bug lists has become an ubiquitous task when dealing with the GNOME or Nokia bugzillas. At some point I became fed up with the involved cut and pasting, searching and sorting, so I cooked up a small command line utility to construct bug list URLs and format bug list summaries in text form. Here’s it in action:

        echo "junktext 556578 moretext 516885 " | buglist.py gnome
        http://bugzilla.gnome.org/buglist.cgi?bug_id=516885,556578
         516885 - Add RGBA support
         556578 - GIMP windows stay on top of other windows

    It knows a good number of bugzillas, such as the ones from Gnome, FreeDesktop, Maemo, Nokia, OpenedHand, GCC, LibC and Mozilla. More bugzilla URLs can easily be added, and it handles HTTPS authentication that some of the corporate bugzilla installations require.

    The script is available here: buglist.py (v0.4)

    Have fun and send in patches. 😉

    Oct 302007
     

    A couple people have reported minor and major bugs in the last yyhelp version, particularly after yycommit got reimplemented to operate on top of git-commit(1) instead of cg-commit(1). Besides some others, this new release fixes all known yycommit issues and also (re-)introduces some new features: yyhelp (v0.9)

    	Overview of Changes in YummyYummySourceControl-0.9:
    
    	* use plain "git-commit" with temporary index file to stage
    	  commit files, this works around git-commit-1.5.2.5 not
              handling deleted files as command line args correctly.
    	* also list remote branches for yylsbranches.
    	* fix leading dot getting stripped from modified files if $gitprefix=.
    	* require and use gawk for time formatting, which mawk doesn't support.
    	* properly honor the [FILES...] arguments to yycommit.
    	* terminate sed command blocks with semicolon (needed on BSD).
    	* resurrected yyhelp.auto-push-commits functionality of yycommit.