Geographically distributed, fault-tolerant and “intelligent” application/host monitoring systems

not an answer really, but some pointers: definitivly take a look at presentation about nagios @ goldman sachs. they faced problems you mention – redundancy, scalability: thousands of hosts, also automated configuration generation. i had redundant nagios setup but at much smaller scale – 80 servers, ~1k services in total. one dedicated master server, one … Read more