April 2016 Archives

Outthentic expansion

Hi!

This is a quick post on latest Outthentic changes going to be released in the nearest future. Not going into technical details, I would gladly say that:

  • Outthentic is turning into "Multipurpose scenarios framework" meaning not only test or reports scripts, but ANY scripts or scenarios could be written on this tool.

  • Outthentic is getting language agnostic tool ( but it is still written on Perl ! ). Now one may use Perl, Bash or Ruby to extend basic DSL, see a lot of examples here and a heavily updated documentation here

All these features are kinda under active development, but one may play with developers version installing from the github source code.

Regards

--- Alexey Melezhik

Monitoring bad ssh logins with sparrow and logdog

Logdog is yet another sparrow plugin to analyze various log files.

One my use logdog for various tasks like finding errors in nginx/apache logs or inspecting suspicious entries in your sshd log file.

One killer feature of logdog is it parsing log lines for the given period of time. Like one can ask to find only entries for last hour or for last 2 days or even 15 minutes. It is very handy as often this is probably what you need - to identify what have happened recently and to not re-examine all the events happened a long time before.

Let's see how one can monitor BAD ssh logins using logdog. The rest part of this blog will look like a formal documentation, but I hope you won't be frustrated! :)

Installing logdog

First of all we need sparrow and a minimal prerequisites:

$ yum install curl
$ cpanm Sparrow
$ sparrow index update
$ sparrow plg install logdog

Now comes configuration part

Configure monitoring suite

 # let's create project - it just a container for sparrow monitoring suites
 $ sparrow project create system
 # sparrow checkpoint is a configurable sparrow plugin 
 $ sparrow check add system ssh-bad-logins
 # let's bind ssh-bad-logins checkpoint to logdog plugin
 $ sparrow check set system ssh-bad-logins logdog
 # and provide configuration:
 $ export EDITOR=nano
 $ sparrow check ini system ssh-bad-logins

Configuration parameters will be:

[logdog]
# this is where sshd keep logs
file = /var/log/auth.log

# we will be interested in only entries related to `bad logins' failures 
filter = Failed password

# as on my server timestamps are in format
# kinds Apr  8 11:35:23 
# this regexp pattern will be use to identify entries times:
time_pattern = (\S+)\s+(\d+)\s+(\S+)\s

# this is required to convert found times ( given by time_pattern )
# into Time Objects and make proper time calculation
time_format = %b %e %T

# This timezone is used on my server
# The list of acceptable timezone
# could be found here
# https://metacpan.org/pod/DateTime::TimeZone::Catalog
timezone = Europe/Moscow

# key_filed is pattern to group found entries 
# as on my server they looks like:
# Failed password for $LOGIN from $IP_ADDRESS port 18367 ssh2
# I want to group them by user login and IP address 
key_field = password for\s+(.*)\s+from\s+(\S+)

# density is additionally to key_field
# it sets number of entries in groups
# if you need to take into account only frequent events  
# you may increase density
# like density = 10 
# means >= 10 bad login attempts for given user from given IP_ADDRESS for 
# given period of time 
density = 1

# I only need a last 10 minutes activity 
history = 10 minutes 



# setting check_mode to zero
# means if more then zero entries are found
# it will be treated as check failure
# yeah, in other words probably someone tries to break your server 
# over ssh
check_mode = zero

Now let's give it a try. But before doing this we need to "generate" some unsuccessful logins:

$ ssh foo@127.0.0.1 
foo@127.0.0.1's password: 
Permission denied (publickey,password).

And so on ... Now let's see if our monitoring suite detect our activity:

$ sparrow check run system ssh-bad-logins  


# cd /root/sparrow/plugins/public/logdog && carton exec 'strun --root ./  --ini /root/sparrow/projects/system/checkpoints/ssh-bad-logins/suite.ini '

/tmp/.outthentic/29141/root/sparrow/plugins/public/logdog/story.t .. 
# history: 10 minutes
# filter: Failed password
# density: 1
# group invalid user foo 127.0.0.1 count: 9
# Apr 12 16:16:02 melezhik-pc sshd[29099]: Failed password for invalid user foo from 127.0.0.1 port 55267 ssh2
# Apr 12 16:16:05 melezhik-pc sshd[29099]: Failed password for invalid user foo from 127.0.0.1 port 55267 ssh2
# Apr 12 16:16:08 melezhik-pc sshd[29099]: Failed password for invalid user foo from 127.0.0.1 port 55267 ssh2
# Apr 12 16:16:12 melezhik-pc sshd[29103]: Failed password for invalid user foo from 127.0.0.1 port 55268 ssh2
# Apr 12 16:16:16 melezhik-pc sshd[29103]: Failed password for invalid user foo from 127.0.0.1 port 55268 ssh2
# Apr 12 16:16:19 melezhik-pc sshd[29103]: Failed password for invalid user foo from 127.0.0.1 port 55268 ssh2
# Apr 12 16:16:22 melezhik-pc sshd[29107]: Failed password for invalid user foo from 127.0.0.1 port 55270 ssh2
# Apr 12 16:16:26 melezhik-pc sshd[29107]: Failed password for invalid user foo from 127.0.0.1 port 55270 ssh2
# Apr 12 16:16:29 melezhik-pc sshd[29107]: Failed password for invalid user foo from 127.0.0.1 port 55270 ssh2
# group invalid user bar 127.0.0.1 count: 3
# Apr 12 16:16:38 melezhik-pc sshd[29111]: Failed password for invalid user bar from 127.0.0.1 port 55271 ssh2
# Apr 12 16:16:41 melezhik-pc sshd[29111]: Failed password for invalid user bar from 127.0.0.1 port 55271 ssh2
# Apr 12 16:16:44 melezhik-pc sshd[29111]: Failed password for invalid user bar from 127.0.0.1 port 55271 ssh2
ok 1 - output match /lines count: (\d+)/
ok 2 - output match /Failed password/
not ok 3 - zero groups found
1..3

#   Failed test 'zero groups found'
#   at /root/sparrow/plugins/public/logdog/local/lib/perl5/Outthentic.pm line 144.
# Looks like you failed 1 test of 3.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/3 subtests 

Test Summary Report
-------------------
/tmp/.outthentic/29141/root/sparrow/plugins/public/logdog/story.t (Wstat: 256 Tests: 3 Failed: 1)
Failed test:  3
Non-zero exit status: 1
Files=1, Tests=3,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.25 cusr  0.00 csys =  0.27 CPU)
Result: FAIL

Now, to get things completely automated let's define a cron job staring every 10 minutes:

$ crontab -l

*/10  * * * * sparrow check run  system ssh-bad-logins   --cron

Conclusion

That is it. As we could see logdog is quite flexible tool. As I already told it could be used for various task, probably eventually I have to say something about monitoring nginx errors or something like that :)

find "stale" processes on linux / unix servers

Sometimes for some reasons processes on your server work unexpectedly long or don't die on time, this might cause many issues basically because of your server resources start to exhaust.

Stale-proc-check is sparrow plugin to show you if any some "stale" processes exists on your server. It depends on ps utility , so will probably work on many linux/unix boxes ...

Below is short manual.

INSTALL

$ sparrow plg install stale-proc-check

USAGE

Once plugin is installed you need to define configuration for it by using sparrow checkpoint container, which just an abstraction for configurable sparrow plugin.

You need to provide 2 parameters:

  • filter - perl regular expression to match a desired process

  • history - a time period to determine that processes found are fresh enough

In others words if any processes older then $history parameter found it will be treated as bad situation and check will fail.

Ok, now we are ready to do configuration:

$ sparrow project create system

$ sparrow check add system stale-ssh-sessions

$ sparrow check set system stale-ssh-sessions stale-proc-check

$ export EDITOR=nano && sparrow check ini system stale-ssh-sessions

[stale-proc-check]
# lets find all knife ssh processes running more than halve an hour
filter = knife\s+ssh
history = 30  minutes

In the example here I will be looking all knife ssh processes running more then halve an hour, on our production system it typical when for some reasons knife ssh commands do not die even though a parent process is terminated. Well we also have a bugs :)

Now let's run a check:

$ sparrow check run system stale-ssh-sessions

# running cd /root/sparrow/plugins/public/stale-proc-check && carton
exec 'strun --root ./  --ini
/root/sparrow/projects/system/checkpoints/stale-ssh-sessions/suite.ini
' ...

/tmp/.outthentic/25883/root/sparrow/plugins/public/stale-proc-check/story.t ..
# filter: knife\s+ssh
# history: 30 minutes
# 0 stale processes found
ok 1 - output match /count: (\d+)/
ok 2 - zero stale processes found
1..2
ok
All tests successful.
Files=1, Tests=2,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.09 cusr
0.00 csys =  0.11 CPU)
Result: PASS

Hurrah, no stale process here ... But at the end of this post let me show a negative case as well. Let's start a few sleep commands and checks if they are still running, indeed they should! :)

$ sleep 1000 &
$ sleep 1000 &
$ sleep 1000 &
$ sleep 1000 &


$ sparrow check add system sleepyheads

$ sparrow check set system sleepyheads stale-proc-check

$ export EDITOR=nano && sparrow check ini system sleepyheads

[stale-proc-check]
# I want to see "sleep commands" only
filter = sleep

# running more then
history = 5  minutes

Now let's see who overstates ( we should wait for about 5 minutes before runing our check ... ) :

 $ sparrow check run system sleepyheads 


# running cd /home/melezhik/sparrow/plugins/public/stale-proc-check && carton exec 'strun --root ./  --ini /home/melezhik/sparrow/projects/system/checkpoints/sleepyheads/suite.ini ' ...

/tmp/.outthentic/5584/home/melezhik/sparrow/plugins/public/stale-proc-check/story.t .. 
# filter: sleep
# history: 5  minutes
# 7 stale processes found
# pid: 3117 command: sleep 1000                        delta: minutes: 5 seconds: 16 
# pid: 3118 command: sleep 1000                        delta: minutes: 5 seconds: 16 
# pid: 3119 command: sleep 1000                        delta: minutes: 5 seconds: 15 
# pid: 3120 command: sleep 1000                        delta: minutes: 5 seconds: 15 
# pid: 3121 command: sleep 1000                        delta: minutes: 5 seconds: 14 
# pid: 3122 command: sleep 1000                        delta: minutes: 5 seconds: 14 
# pid: 3123 command: sleep 1000                        delta: minutes: 5 seconds: 13 
ok 1 - output match /count: (\d+)/
ok 2 - [b] output match 'start_proc_data'
ok 3 - [b] output match 'pid'
ok 4 - [b] output match 'command'
ok 5 - [b] output match 'time'
ok 6 - [b] output match 'delta'
ok 7 - [b] output match 'end_proc_data'
not ok 8 - zero stale processes found
1..8

#   Failed test 'zero stale processes found'
#   at /home/melezhik/sparrow/plugins/public/stale-proc-check/local/lib/perl5/Outthentic.pm line 130.
# Looks like you failed 1 test of 8.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/8 subtests 

Test Summary Report
-------------------
/tmp/.outthentic/5584/home/melezhik/sparrow/plugins/public/stale-proc-check/story.t (Wstat: 256 Tests: 8 Failed: 1)
  Failed test:  8
  Non-zero exit status: 1
Files=1, Tests=8,  1 wallclock secs ( 0.01 usr  0.01 sys +  0.05 cusr  0.01 csys =  0.08 CPU)
Result: FAIL

monitoring failed minion tasks with minion-check

Minion-check is a sparrow plugin to monitor failed minion tasks.

One could easily verify if any minion jobs are failed for a certain period of time. Which could be important to understand if you have any failures for your long running task executed by minion.

The installation:

$ sparrow plg install minion-check 
$ sparrow project create myhost
$ sparrow check add myhost minion
$ sparrow check set myhost minion minion-check

And the configuration:

Here you need to provide bash command to run your minion command and optionally period of time to look up failed minion tasks:

$ export EDITOR=nano
$ sparrow check ini myhost minion


# bash script to run minion command
command = cd /foo/bar/app/path && carton exec ./app.pl minion

 # check failed tasks for last 5 minutes, 10 hours, 2 days , etc ...
 history = 10 minutes

That' ok. Now let's run a plugin on the host where your minion workers run:

$ sparrow check run  myhost minion 
# running cd /home/vagrant/sparrow/plugins/public/minion-check && carton exec 'strun --root ./  --ini /home/vagrant/sparrow/projects/myhost/checkpoints/minion/suite.ini ' ...

/tmp/.outthentic/18157/home/vagrant/sparrow/plugins/public/minion-check/failed-tasks/story.t ..
ok 1 - stdout is already set
ok 2 - stdout saved to /tmp/.outthentic/18157/3r2uXbq_kM
ok 3 - output match /Q=(1|0)/
ok 4 - output match /(\d+)\s+failed/
ok 5 - stdout is already set
ok 6 - stdout saved to /tmp/.outthentic/18157/HpQ2V5mxm4
# foo_task (default, failed, p0, r0)
# []
# "100 at app.pl line 11, <DATA> line 742.\n"
# 2016-04-01T19:06:24Z (created)
# 2016-04-01T19:08:52Z (started)
# 2016-04-01T19:08:53Z (finished)
ok 7 - output match /(.*)/
ok 8 - output match /finished/
ok 9 - '2016-04-01T19:08:53Z (finished)' match /(\d\d\d\d-\d\d-\d\d)T(\S+)Z.*/
not ok 10 - 0 failed jobs found for period 10 minutes
1..10

#   Failed test '0 failed jobs found for period 10 minutes'
#   at /home/vagrant/sparrow/plugins/public/minion-check/local/lib/perl5/Outthentic.pm line 130.
# Looks like you failed 1 test of 10.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/10 subtests

Test Summary Report
-------------------
/tmp/.outthentic/18157/home/vagrant/sparrow/plugins/public/minion-check/failed-tasks/story.t (Wstat: 256 Tests: 10 Failed: 1)
  Failed test:  10
  Non-zero exit status: 1
Files=1, Tests=10,  1 wallclock secs ( 0.03 usr  0.00 sys +  0.60 cusr  0.05 csys =  0.68 CPU)
Result: FAIL

Running by cron.

Thus, one may monitor minion tasks failures by cron every 10 minutes:

$ crontab -l

*/10 * * * * sparrow check run  myhost minion --cron

Running sparrow check with --cron options means giving output only in case of test failures ...

About melezhik

user-pic Dev & Devops --- Then I beheld all the work of God, that a man cannot find out the work that is done under the sun: because though a man labour to seek it out, yet he shall not find it; yea further; though a wise man think to know it, yet shall he not be able to find it. (Ecclesiastes 8:17)