How to monitor and log the memory/cpu usage of processes over time? [closed]

It you want just the top offenders, consider running top with a relatively long interval (60 seconds plus) in batch mode. You may need more than one top running to capture the top offenders on multiple resources. I have configured systems to run top for a few cycles when a resource was being over used.

Consider running sar in batch mode to capture resource utilization. I realize this is server based, but it useful to determine times when problems are occurring.

Run munin and enable notifications. This may give you a chance to get in and watch the server going down. You may be able to correct the problem before it goes down.

For memory leaks, a steady increase in swap usage indicates a problem. I once watched a server slowly die over a period of days. The problem service was a program monitoring other processes for memory leaks. The system admin kept insisting the increasing swap usage was not a problem, right up until the server stopped responding.

You may find that cfengine‘s anomaly detection can be used to trigger a script to capture the system state when things go wrong. You may want a lot of information besides just the processes using the most resources. For a sudden influx of usage you may want a list of network connections (by address not name). Memory usage is also useful.

Leave a Comment