find "stale" processes on linux / unix servers
Sometimes for some reasons processes on your server work unexpectedly long or don't die on time, this might cause many issues basically because of your server resources start to exhaust.
Stale-proc-check is sparrow plugin to show you if any some "stale" processes exists on your server. It depends on ps
utility ,
so will probably work on many linux/unix boxes ...
Below is short manual.
INSTALL
$ sparrow plg install stale-proc-check
USAGE
Once plugin is installed you need to define configuration for it by using sparrow checkpoint container, which just an abstraction for configurable sparrow plugin.
You need to provide 2 parameters:
filter - perl regular expression to match a desired process
history - a time period to determine that processes found are fresh enough
In others words if any processes older then $history parameter found it will be treated as bad situation and check will fail.
Ok, now we are ready to do configuration:
$ sparrow project create system
$ sparrow check add system stale-ssh-sessions
$ sparrow check set system stale-ssh-sessions stale-proc-check
$ export EDITOR=nano && sparrow check ini system stale-ssh-sessions
[stale-proc-check]
# lets find all knife ssh processes running more than halve an hour
filter = knife\s+ssh
history = 30 minutes
In the example here I will be looking all knife ssh
processes running more then halve an hour, on our production system it typical
when for some reasons knife ssh
commands do not die even though a parent process is terminated. Well we also have a bugs :)
Now let's run a check:
$ sparrow check run system stale-ssh-sessions
# running cd /root/sparrow/plugins/public/stale-proc-check && carton
exec 'strun --root ./ --ini
/root/sparrow/projects/system/checkpoints/stale-ssh-sessions/suite.ini
' ...
/tmp/.outthentic/25883/root/sparrow/plugins/public/stale-proc-check/story.t ..
# filter: knife\s+ssh
# history: 30 minutes
# 0 stale processes found
ok 1 - output match /count: (\d+)/
ok 2 - zero stale processes found
1..2
ok
All tests successful.
Files=1, Tests=2, 0 wallclock secs ( 0.02 usr 0.00 sys + 0.09 cusr
0.00 csys = 0.11 CPU)
Result: PASS
Hurrah, no stale process here ... But at the end of this post let me show a negative case as well. Let's start a few sleep
commands and checks if they are still running, indeed they should! :)
$ sleep 1000 &
$ sleep 1000 &
$ sleep 1000 &
$ sleep 1000 &
$ sparrow check add system sleepyheads
$ sparrow check set system sleepyheads stale-proc-check
$ export EDITOR=nano && sparrow check ini system sleepyheads
[stale-proc-check]
# I want to see "sleep commands" only
filter = sleep
# running more then
history = 5 minutes
Now let's see who overstates ( we should wait for about 5 minutes before runing our check ... ) :
$ sparrow check run system sleepyheads
# running cd /home/melezhik/sparrow/plugins/public/stale-proc-check && carton exec 'strun --root ./ --ini /home/melezhik/sparrow/projects/system/checkpoints/sleepyheads/suite.ini ' ...
/tmp/.outthentic/5584/home/melezhik/sparrow/plugins/public/stale-proc-check/story.t ..
# filter: sleep
# history: 5 minutes
# 7 stale processes found
# pid: 3117 command: sleep 1000 delta: minutes: 5 seconds: 16
# pid: 3118 command: sleep 1000 delta: minutes: 5 seconds: 16
# pid: 3119 command: sleep 1000 delta: minutes: 5 seconds: 15
# pid: 3120 command: sleep 1000 delta: minutes: 5 seconds: 15
# pid: 3121 command: sleep 1000 delta: minutes: 5 seconds: 14
# pid: 3122 command: sleep 1000 delta: minutes: 5 seconds: 14
# pid: 3123 command: sleep 1000 delta: minutes: 5 seconds: 13
ok 1 - output match /count: (\d+)/
ok 2 - [b] output match 'start_proc_data'
ok 3 - [b] output match 'pid'
ok 4 - [b] output match 'command'
ok 5 - [b] output match 'time'
ok 6 - [b] output match 'delta'
ok 7 - [b] output match 'end_proc_data'
not ok 8 - zero stale processes found
1..8
# Failed test 'zero stale processes found'
# at /home/melezhik/sparrow/plugins/public/stale-proc-check/local/lib/perl5/Outthentic.pm line 130.
# Looks like you failed 1 test of 8.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/8 subtests
Test Summary Report
-------------------
/tmp/.outthentic/5584/home/melezhik/sparrow/plugins/public/stale-proc-check/story.t (Wstat: 256 Tests: 8 Failed: 1)
Failed test: 8
Non-zero exit status: 1
Files=1, Tests=8, 1 wallclock secs ( 0.01 usr 0.01 sys + 0.05 cusr 0.01 csys = 0.08 CPU)
Result: FAIL
Leave a comment