For ages on 5.2/Alpha, I had a problem where syslogd would start chewing up the cpu spewing out messages like: Aug 5 05:35:30 canoe syslogd: select: Bad file descriptor Aug 5 05:36:01 canoe last message repeated 12432 times Aug 5 05:37:02 canoe last message repeated 50757 times Aug 5 05:38:03 canoe last message repeated 51313 times It was eventually fixed with a updated rpm 1.3.31-0.5. I upgraded to 6.0 last week, and it appears that the bug is back. I'm runnning sysklogd-1.3.31-6, and as you can see from the above chunk of my logs, I'm getting exactly the same problem again. Did the bug fix somehow not make it to this version of RedHat? Should I downgrade?
That shouldn't make a difference - the sysklogd source in 6.0 is identical to the 5.2 update. Are you using network logging?
Yes, I'm using network logging. I have a 5.2 Intel machine and a 6.0 Intel machine both logging to this one. But both of these machines logged to this machine before I upgraded the Alpha to 6.0 with no problems. Note that before the fix for 5.2, I used to see this symptom every week or so, then after the fix I didn't see it at all until I upgraded to 6.0 last week. I added the -r argument to syslog in /etc/rc.d/init.d/syslog.
Do you have anything unusual in your syslog.conf? (I can't seem to reproduce it here.)
My syslog.conf is pretty generic: # Log all kernel messages to the console. # Logging much else clutters up the screen. #kern.* /dev/console # Log anything (except mail) of level info or higher. # Don't log private authentication messages! *.info;mail.none;news.none;authpriv.none /var/log/messages # The authpriv file has restricted access. authpriv.* /var/log/secure # Log all the mail messages in one place. mail.* -/var/log/maillog # Everybody gets emergency messages, plus log them on another # machine. *.emerg * # Save mail and news errors of level err and higher in a # special file. uucp,news.crit /var/log/spooler # # INN # #news.=crit /var/log/news/news.crit #news.=err /var/log/news/news.err #news.notice /var/log/news/news.notice news.=crit /home/news/log/news.crit news.=err /home/news/log/news.err news.notice /home/news/log/news.notice I'm not suprised you can't reproduce it 2 hours after I opened the bug, since as I said earlier it often takes a week to show up. Next time it shows up, I'll try an lsof and strace on the syslogd program.
Hmm... does it show up right after anything in particular - for example, log rotation?
Yeah, it was pretty early in the morning, and pretty early in the log. I'd show you the exact part of the log, but the machine isn't responding to keystrokes in my window that's telnetted into it, and it's not accepting remote logins all of a sudden. I think it's crashing, or my company's firewall is shitting itself again. (Funnily enough, the web server seems to still be running, though)
Well, I found out why I couldn't log into my home machine before: syslogd/klogd was well and truely hosed. I tried to do a /etc/rc.d/init.d/syslog restart (as I had done a few hours before to clear the problem) and it hung while trying to shut down klogd. There is nothing in /var/log/messages after about 12:50. Anyway, as you guessed, the initial problem started right about when the logs rotated. Here is the first part of /var/log/messages: Aug 5 05:35:26 canoe syslogd 1.3-3: restart (remote reception). Aug 5 05:35:27 canoe syslogd 1.3-3: restart (remote reception). Aug 5 05:35:27 canoe syslogd: select: Bad file descriptor Aug 5 05:35:29 canoe last message repeated 1145 times Aug 5 05:35:29 canoe syslogd 1.3-3: restart (remote reception). Aug 5 05:35:29 canoe syslogd: select: Bad file descriptor Aug 5 05:35:30 canoe last message repeated 598 times Aug 5 05:35:30 canoe syslogd 1.3-3: restart (remote reception). Aug 5 05:35:30 canoe syslogd: select: Bad file descriptor Aug 5 05:36:01 canoe last message repeated 12432 times Aug 5 05:37:02 canoe last message repeated 50757 times and here is where I restarted the first time: Aug 5 10:11:52 canoe last message repeated 22379 times Aug 5 10:11:52 canoe exiting on signal 15 Aug 5 10:11:58 canoe syslogd 1.3-3: restart (remote reception). Aug 5 10:11:58 canoe syslog: syslogd startup succeeded Aug 5 10:11:55 canoe syslog: syslogd shutdown succeeded Aug 5 10:12:01 canoe syslog: klogd startup succeeded Aug 5 10:12:01 canoe kernel: klogd 1.3-3, log source = ksyslog started. Aug 5 10:12:01 canoe kernel: Inspecting /boot/System.map-2.2.10 Aug 5 10:12:04 canoe kernel: Loaded 6189 symbols from /boot/System.map-2.2.10. Aug 5 10:12:05 canoe kernel: Symbols match kernel version 2.2.10. Aug 5 10:12:05 canoe kernel: Error seeking in /dev/kmem Aug 5 10:12:05 canoe kernel: Symbol #appletalk, value 00058000 Aug 5 10:12:05 canoe kernel: Error adding kernel module table entry. Aug 5 10:12:12 cirrus identd[8242]: from: 132.235.8.44 ( oak.cats.ohiou.edu ) for: 64045, 25 and here is where it died: Aug 5 12:29:13 canoe identd[24171]: Connection from localhost Aug 5 12:29:13 canoe kernel: in.identd(24171): unaligned trap at 0000020000183c 4c: 000000011ffff574 2d 10 Aug 5 12:29:13 canoe identd[24171]: from: 127.0.0.1 ( localhost ) for: 2032, 25 Aug 5 12:29:13 canoe kernel: in.identd(24171): unaligned trap at 0000020000183c 4c: 000000011ffff574 2d 10 Aug 5 12:29:13 canoe last message repeated 2 times Aug 5 12:36:56 canoe PAM_pwdb[24386]: (su) session opened for user root by ptom blin(uid=406) Aug 5 14:18:27 canoe syslogd 1.3-3: restart (remote reception). Even though I had a root login, I couldn't even do a reboot, since it hung while trying to kill gpm. I suspect the problem was that it couldn't log what it was doing.
The damn thing happened again this morning. Again, it appears to have happened while rotating logs. There were three messages on my console that weren't in any logfiles. I didn't write down the exact message, but it was something like "syslogd: couldn't open file, already in use". Is there some sort of race condition in the way logrotate rotates one file, SIGHUPs syslogd, then rotates another file, SIGHUPs syslogd, then rotates another file, and SIGHUPs syslogd? This machine is a UDB, and it's running an innd news server, apache web server, NFS server, netatalk+asun appleshare server, and a few other things, so there are times when it gets a big overburdened.
Not that I know of, although I'll look again at the code. I've tried rotating all the logs even while running "find |logger" on a remote machine (albeit on a PC164) without any problems.
I did a run-parts /etc/cron.daily this morning (after rebooting) with no ill effects, but it appears that logrotate didn't rotate the syslogs because they are less than a full day old. I wonder if it might be a better test to do the logrotate while something is really hammering syslog so that it's slow responding to the HUP. Oh, I just remembered something which may or may not be cogent: /var/log is a symlink to a directory on sdb2 (second partition on second scsi disk), because / ran out of space the other day and I didn't have a spare partition I could just mount as /var/log. I did that on July 27th, and the morning of August 5 is the first time I saw the syslog file descriptor bug.
OK, I just played with multiple logrotate -f (force), while building glibc, the kernel, and linuxconf, running a find both locally and remotely, and piping them all through logger. It still didn't do anything strange, aside from causing the machine to run seriously low on disk space. What kernel are you running?
I've been running 2.2.10 for a week or so. I'm going to try 2.2.10ac12 today. But remember what I said earlier - this machine was running just fine for over a week with RedHat 6.0 before it started getting this problem. But then again, RedHat never believed it when I reported the EXACT same problem with RedHat 5.2, until BugTraq found and fixed a stack overrun.
Isn't that special? After rebooting this morning with kernel 2.2.10ac12, I was sitting here at work logged into this alpha with a tail -f /var/spool/messages and another window logged into it doing something useful (reading news) when it suddenly stopped responding. The last thing I see on my screen is Aug 6 12:28:30 cirrus identd[9938]: Returned: 61059 , 25 : NO-USER Aug 6 12:28:32 cirrus identd[9939]: from: 128.135.130.21 ( gsbpop.uchicago.edu ) for: 61061, 25 which freaks me out because 12:29 is when it stopped working yesterday. However, unlike yesterday I can't ping it or access the web server, so I don't know what is going on.
I'm sending a patch to bugzilla (check under the e-mail link on the page for this bug) which solves at least one problem with klogd. This may not solve your problem though.
I haven't seen the problem back since those two occurances on the 5th and the 6th. That's coincidentally when I started using 2.2.10ac12 instead of 2.2.10, but that could be coincidence. The other hangup I reported on the 6th was a random hang I get every few days as well, probably hardware related. (it's an Alpha UDB and it's showing its age.) Do you want me to get the source code and apply that patch, or can you send me an RPM? Oh, related question: Should I be running klogd with the -s switch or not? I was, because I'd been told to when I was having the problem with RedHat 5.2, but it didn't fix anything then.
That might not be a coincidence - AFAIR, Alan applied some alpha-specific patches between -10 plain and ac12. No, the -s option shouldn't be necessary. If you want the RPM, I've put copies at: ftp://people.redhat.com/notting/ and http://people.redhat.com/notting/ftp/ (It will be in Raw Hide soon, but that might be a day or two...)
Apologies for the delay. Has the problem repeated at all since the updated kernel? ------- Additional Comments From 09/20/99 20:13 ------- I never saw the problem again, but a few weeks ago I replaced that Alpha machine with an Intel machine.
Sorry, but this problem does not seems to be resolved. I've been able to reproduce it (without wanting to...) with RedHat 6.1 (i586), kernel-2.2.12-20 and glibc-2.1.2-11. I was using sysklogd-1.3.31-14.i386, updating to sysklogd-1.3.31-16 did not solve the problem. When syslogd starts, I get this in /var/log/messages: Aug 24 09:53:19 sanitized syslogd 1.3-3: restart. Aug 24 09:53:19 sanitized syslogd: select: Bad file descriptor Aug 24 09:53:50 sanitized last message repeated 416955 times Aug 24 09:54:51 sanitized last message repeated 860294 times Aug 24 09:55:52 sanitized last message repeated 869058 times Aug 24 09:56:53 sanitized last message repeated 870385 times Aug 24 09:57:54 sanitized last message repeated 869721 times Aug 24 09:58:55 sanitized last message repeated 869977 times Aug 24 09:59:56 sanitized last message repeated 869792 times Aug 24 10:00:57 sanitized last message repeated 862309 times Aug 24 10:01:58 sanitized last message repeated 851399 times Aug 24 10:02:59 sanitized last message repeated 862334 times Aug 24 10:04:00 sanitized last message repeated 859975 times Aug 24 10:05:01 sanitized last message repeated 859789 times Aug 24 10:06:02 sanitized last message repeated 863366 times Aug 24 10:07:03 sanitized last message repeated 868746 times Aug 24 10:08:04 sanitized last message repeated 868313 times Aug 24 10:09:05 sanitized last message repeated 868068 times Aug 24 10:10:06 sanitized last message repeated 867668 times Aug 24 10:11:07 sanitized last message repeated 870930 times Aug 24 10:12:08 sanitized last message repeated 870589 times Aug 24 10:13:09 sanitized last message repeated 870940 times Aug 24 10:14:10 sanitized last message repeated 870386 times Aug 24 10:15:11 sanitized last message repeated 871397 times Aug 24 10:16:12 sanitized last message repeated 877305 times Aug 24 10:17:13 sanitized last message repeated 876981 times Aug 24 10:18:14 sanitized last message repeated 877334 times and my CPU load peeks up at about 10 until syslogd or the kernel crashes. Could someone with the right privileges reopen this bug, set platform to All and version to 6.1 (or tell me how to solve this bug...). Thanks, GFK's
Okay, I started playing with the init.d script that starts syslogd and realised that the command to start syslog was: daemon syslogd -m 0 -a /home/dns/dev/log (For those wondering, it's bastille-linux that did that, and I don't even have bind installed... grrr). The folder /home/dns doesn't exits anymore, I deleted it months ago when I realised bastille created it. This was a mistake. 2 days ago syslogd started to do the mess described above. Changing the command starting syslog to: daemon syslogd -m 0 solved the problem. Maibe a fix for this could be that if someone adds a non existing socket, syslog doesn't fall into an infinite loop. Hope this helps clarify the problem, GFK's