Had the strangest problem with nagios today. I noticed that I was not recieving email notifications when services went down. Nagios would log that it saw the problem and update the webpage correctly but when it came to sending an email notification I got nothing. It logged that the emails went out in it's log but when watching for it on the nagios machine you saw nothing. The log looked like this:
Oct 27 16:02:02 nagios nagios: SERVICE ALERT: mail;SSH;CRITICAL;HARD;5;CRITICAL - Socket timeout after 10 seconds Oct 27 16:02:02 nagios nagios: SERVICE NOTIFICATION: tech1;mail;SSH;CRITICAL;notify-by-email;CRITICAL - Socket timeout after 10 seconds
Postfix had not logged an email going out. Tcpdump showed no emails going out when it was supposedly sent the email. I was confused to say the least.
Nagios uses regular unix programs (printf and mail) to send it's email. I tried using the line nagios uses to send mail on the machine and it went out fine. I finally broke down and compiled nagios with ultra (all) debug turned on. The webpage will not work with debug turned on but the notifications and checks will. When it came time to send the mail this is what I saw:
/tmp/RshkRO1F: Permission denied
Permissions on /tmp ??? WTF? Sure enough /tmp's permissions were screwed up. Showing:
drwxr-xr-x 5 100 users 4096 Oct 27 15:58 tmp
So setting them back to the correct perms (below) fixed the problem right up. Mail could not create a temp file to send out its email. Nagios does not seem to check if the mail went out correctly so you end up with nothing being logged anywhere.
chown root:root /tmp
chmod 1777 /tmp
I got nailed with windows 2003 server sp1's new Data Execution Protection (DEP) (stack protection) today. I was trying to install the nagios NS Client program on a server with DEP turned on. When you tried to start the nagios agent service you would get "System Error 1067 has occurred". Which means the process was aborted and windows says "The process terminated unexpectedly". To make an exception for certain programs to run without DEP you need to do the following in W2k3 SP1: Right click "My Computer" then "Properties". Click the "Advanced" tab then click the "Settings" button under the "Performance" section. Click on the "Data Execution Prevention" tab and then click the radio button "Turn on DEP for all programs and services except those I select". Then click the "Add" button and add your exe you don't want stack protection for. That problem was fun to hunt down.
I was setting up Nagios to monitor some systems and finally got to the printers. Well, some one made a HP plugin for Nagios to check the laserjets status (toner low, ok, etc). This was great because we have 11 HP lasers but we also have 4 Xerox Phaser 7300's. There was of course no plugin to tell us of toner was low and such for this printer. Well cool thing is the printer runs a webpage that has a page where you can see the status of either OK, LOW, or Empty on the toner and fusers. I needed to check if the page says "Empty". If it said "Empty" we need a "Critical" state if not then we are good and give a "OK" state. Well Nagios has a plugin that can check strings on a webpage. I thought fantastic I'll just check for "Empty" and if so set it to give give a critical.
Well thing is the check_http program can check for a string on a webpage but when it finds it it gives an "OK" response. This postive response "OK" is good for when a string you looking for is supposed to be there but not good if you want a negitave response "Critical" if the string is there. Well, check_http did not have that function it only gave positive responses to finding the string. So I've heard that C is like Perl in some ways so I should be able to put a "!" in front of the string check in the source code to have it give a negative response if the string is on the page.
Well lucky me C and Perl share many of the same operators and the work the same. I made a new variable, added in a new switch at the top for my negitave response string check, and slapped in another "If" statement with the negitave check, and added the string output to the --help command line. Recompiled. And walla! Works like a charm! Now we can check our Xerox printers and see if they are out of supplies.
So if it were not for Open Source I would not be able to add my own needed features. I've always thought it was cool but it never hit home like this before.
The diff for the check_http is below. It was done on version 1.4.3 of Nagios Plugins.
85a86 > char string_noexpect[MAX_INPUT_BUFFER] = ""; 172a174 > {"nostring", required_argument, 0, 'g'}, 207c209 < c = getopt_long (argc, argv, "Vvh46t:c:w:A:k:H:P:T:I:a:e:p:s:R:r:u:f:C:nlLSm:M:N", longopts, &option); --- > c = getopt_long (argc, argv, "Vvh46t:c:w:A:k:H:P:T:I:a:e:g:p:s:R:r:u:f:C:nlLSm:M:N", longopts, &option); 327a330,333 > case 'g': /* string or substring */ > strncpy (string_noexpect, optarg, MAX_INPUT_BUFFER - 1); > string_noexpect[MAX_INPUT_BUFFER - 1] = 0; > break; 994a1001,1017 > > if (strlen (string_noexpect)) { > if (!strstr (page, string_noexpect)) { > printf (_("HTTP OK %s - %.3f second response time %s%s|%s %s\n"), > status_line, elapsed_time, > timestamp, (display_html ? "</A>" : ""), > perfd_time (elapsed_time), perfd_size (pagesize)); > exit (STATE_OK); > } > else { > printf (_("CRITICAL - string found%s|%s %s\n"), > (display_html ? "</A>" : ""), > perfd_time (elapsed_time), perfd_size (pagesize)); > exit (STATE_CRITICAL); > } > } > 1259a1283,1284 > -g, --nostring\n\ > String not to expect in the content\n\ 1344c1369 < printf (" [-s string] [-l] [-r <regex> | -R <case-insensitive regex>] [-P string]\n"); --- > printf (" [-s string] [-g string] [-l] [-r <regex> | -R <case-insensitive regex>] [-P string]\n");