find "stale" processes on linux / unix servers

Sometimes for some reasons processes on your server work unexpectedly long or don't die on time, this might cause many issues basically because of your server resources start to exhaust.

Stale-proc-check is sparrow plugin to show you if any some "stale" processes exists on your server. It depends on ps utility , so will probably work on many linux/unix boxes ...

Below is short manual.

INSTALL

$ sparrow plg install stale-proc-check

USAGE

Once plugin is installed you need to define configuration for it by using sparrow checkpoint container, which just an abstraction for configurable sparrow plugin.

You need to provide 2 parameters:

  • filter - perl regular expression to match a desired process

  • history - a time period to determine that processes found are fresh enough

In others words if any processes older then $history parameter found it will be treated as bad situation and check will fail.

Ok, now we are ready to do configuration:

$ sparrow project create system

$ sparrow check add system stale-ssh-sessions

$ sparrow check set system stale-ssh-sessions stale-proc-check

$ export EDITOR=nano && sparrow check ini system stale-ssh-sessions

[stale-proc-check]
# lets find all knife ssh processes running more than halve an hour
filter = knife\s+ssh
history = 30  minutes

In the example here I will be looking all knife ssh processes running more then halve an hour, on our production system it typical when for some reasons knife ssh commands do not die even though a parent process is terminated. Well we also have a bugs :)

Now let's run a check:

$ sparrow check run system stale-ssh-sessions

# running cd /root/sparrow/plugins/public/stale-proc-check && carton
exec 'strun --root ./  --ini
/root/sparrow/projects/system/checkpoints/stale-ssh-sessions/suite.ini
' ...

/tmp/.outthentic/25883/root/sparrow/plugins/public/stale-proc-check/story.t ..
# filter: knife\s+ssh
# history: 30 minutes
# 0 stale processes found
ok 1 - output match /count: (\d+)/
ok 2 - zero stale processes found
1..2
ok
All tests successful.
Files=1, Tests=2,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.09 cusr
0.00 csys =  0.11 CPU)
Result: PASS

Hurrah, no stale process here ... But at the end of this post let me show a negative case as well. Let's start a few sleep commands and checks if they are still running, indeed they should! :)

$ sleep 1000 &
$ sleep 1000 &
$ sleep 1000 &
$ sleep 1000 &


$ sparrow check add system sleepyheads

$ sparrow check set system sleepyheads stale-proc-check

$ export EDITOR=nano && sparrow check ini system sleepyheads

[stale-proc-check]
# I want to see "sleep commands" only
filter = sleep

# running more then
history = 5  minutes

Now let's see who overstates ( we should wait for about 5 minutes before runing our check ... ) :

 $ sparrow check run system sleepyheads 


# running cd /home/melezhik/sparrow/plugins/public/stale-proc-check && carton exec 'strun --root ./  --ini /home/melezhik/sparrow/projects/system/checkpoints/sleepyheads/suite.ini ' ...

/tmp/.outthentic/5584/home/melezhik/sparrow/plugins/public/stale-proc-check/story.t .. 
# filter: sleep
# history: 5  minutes
# 7 stale processes found
# pid: 3117 command: sleep 1000                        delta: minutes: 5 seconds: 16 
# pid: 3118 command: sleep 1000                        delta: minutes: 5 seconds: 16 
# pid: 3119 command: sleep 1000                        delta: minutes: 5 seconds: 15 
# pid: 3120 command: sleep 1000                        delta: minutes: 5 seconds: 15 
# pid: 3121 command: sleep 1000                        delta: minutes: 5 seconds: 14 
# pid: 3122 command: sleep 1000                        delta: minutes: 5 seconds: 14 
# pid: 3123 command: sleep 1000                        delta: minutes: 5 seconds: 13 
ok 1 - output match /count: (\d+)/
ok 2 - [b] output match 'start_proc_data'
ok 3 - [b] output match 'pid'
ok 4 - [b] output match 'command'
ok 5 - [b] output match 'time'
ok 6 - [b] output match 'delta'
ok 7 - [b] output match 'end_proc_data'
not ok 8 - zero stale processes found
1..8

#   Failed test 'zero stale processes found'
#   at /home/melezhik/sparrow/plugins/public/stale-proc-check/local/lib/perl5/Outthentic.pm line 130.
# Looks like you failed 1 test of 8.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/8 subtests 

Test Summary Report
-------------------
/tmp/.outthentic/5584/home/melezhik/sparrow/plugins/public/stale-proc-check/story.t (Wstat: 256 Tests: 8 Failed: 1)
  Failed test:  8
  Non-zero exit status: 1
Files=1, Tests=8,  1 wallclock secs ( 0.01 usr  0.01 sys +  0.05 cusr  0.01 csys =  0.08 CPU)
Result: FAIL

Leave a comment

About melezhik

user-pic Dev & Devops --- Then I beheld all the work of God, that a man cannot find out the work that is done under the sun: because though a man labour to seek it out, yet he shall not find it; yea further; though a wise man think to know it, yet shall he not be able to find it. (Ecclesiastes 8:17)