It you want just the top offenders, consider running top
with a relatively long interval (60 seconds plus) in batch mode. You may need more than one top
running to capture the top offenders on multiple resources. I have configured systems to run top
for a few cycles when a resource was being over used.
Consider running sar
in batch mode to capture resource utilization. I realize this is server based, but it useful to determine times when problems are occurring.
Run munin
and enable notifications. This may give you a chance to get in and watch the server going down. You may be able to correct the problem before it goes down.
For memory leaks, a steady increase in swap usage indicates a problem. I once watched a server slowly die over a period of days. The problem service was a program monitoring other processes for memory leaks. The system admin kept insisting the increasing swap usage was not a problem, right up until the server stopped responding.
You may find that cfengine
‘s anomaly detection can be used to trigger a script to capture the system state when things go wrong. You may want a lot of information besides just the processes using the most resources. For a sudden influx of usage you may want a list of network connections (by address not name). Memory usage is also useful.