What’s the first thing you check when an untouched unix server starts going berserk?

First Order: Is it responsive? If you can’t log in, there’s bigger problems afoot. This generally comes in two flavors: hardware failure, and software failure. Both are potentially catastrophic. To prevent DFA errors, check the general hardware health first – a simple glance-over usually will suffice. Second Order: Are the system’s underlying structures in good … Read more

How do I backup my TRAC installations?

To fully recover trac environment you need following things: backup DB; backup configuration files; backup wiki files (html and attachments); backup password files if you’re using htpasswd auth; optional plugins (even though this are available for download, I’d backup them for quicker recovery); In case of the standard setup (with SQLite as BD backend), this … Read more

Battery Backed Write Cache

What exactly does it do? The excerpt from this Compaq document explains it well: Power interruptions, even for brief moments, result in the loss of data which was being written to or read from storage… Power interruptions can have terminal effects on data which is in the process of being written and is temporarily residing … Read more

Setting up a new backup scheme

I would highly recommend the book “Backup & Recovery” (O’Reilly Book) by W. Curtis Preston http://oreilly.com/catalog/9780596102463/ Asking how to do your backup plan is kinda like asking 10 grandmothers how to make the best chicken noodle soup. You’ll get 10 different answers but all of them will agree on the basic ingredients. In my opinion, … Read more

How to recover from a drive failure in a RAID 5 configuration?

The system is running very slowly because it has to reconstruct the missing data which involves additional CPU and I/O. If you have a missing disk in a RAID-5 configuration you have no recovery strategy. If another disk goes down you will lose your data. Run, don’t walk, to the nearest vendor from which you … Read more

Documentation As-A-Manual vs. Documentation As-A-Checklist

When writing mine I’ve always devolved into writing two three sets. The get-er-done checklist, with a MUCH LONGER appendix about the architecture of the system including why things are done the way they are, probable sticking points when coming online, and abstract design assumptions. followed by a list of probable problems and their resolutions, followed … Read more

Architecture for highly available MySQL with automatic failover in physically diverse locations

You will face the “CAP” theorem problem. You cannot have consistency, availability and partition-tolerance at the same time. DRBD / MySQL HA relies on synchronous replication at the block device level. This is fine while both nodes are available, or if one suffers a temporary fault, is rebooted etc, then comes back. The problems start … Read more