4371 – syslogd file descriptor bug is back

Bug 4371 - syslogd file descriptor bug is back

Summary: syslogd file descriptor bug is back

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	sysklogd
Sub Component:
Version:	6.0
Hardware:	alpha
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Bill Nottingham
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	1999-08-05 14:38 UTC by ptomblin
Modified:	2014-03-17 02:09 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	1999-09-25 02:19:14 UTC
Embargoed:

Attachments	(Terms of Use)

Description ptomblin 1999-08-05 14:38:28 UTC

For ages on 5.2/Alpha, I had a problem where syslogd would
start chewing up the cpu spewing out messages like:
Aug  5 05:35:30 canoe syslogd: select: Bad file descriptor
Aug  5 05:36:01 canoe last message repeated 12432 times
Aug  5 05:37:02 canoe last message repeated 50757 times
Aug  5 05:38:03 canoe last message repeated 51313 times
It was eventually fixed with a updated rpm 1.3.31-0.5.

I upgraded to 6.0 last week, and it appears that the bug is
back.  I'm runnning sysklogd-1.3.31-6, and as you can see
from the above chunk of my logs, I'm getting exactly the
same problem again.  Did the bug fix somehow not make it to
this version of RedHat?  Should I downgrade?

Comment 1 Bill Nottingham 1999-08-05 14:50:59 UTC

That shouldn't make a difference - the sysklogd source in 6.0
is identical to the 5.2 update.

Are you using network logging?

Comment 2 ptomblin 1999-08-05 15:03:59 UTC

Yes, I'm using network logging.  I have a 5.2 Intel machine and a 6.0
Intel machine both logging to this one.  But both of these machines
logged to this machine before I upgraded the Alpha to 6.0 with no
problems.

Note that before the fix for 5.2, I used to see this symptom every
week or so, then after the fix I didn't see it at all until I upgraded
to 6.0 last week.

I added the -r argument to syslog in /etc/rc.d/init.d/syslog.

Comment 3 Bill Nottingham 1999-08-05 15:30:59 UTC

Do you have anything unusual in your syslog.conf?
(I can't seem to reproduce it here.)

Comment 4 ptomblin 1999-08-05 16:41:59 UTC

My syslog.conf is pretty generic:
# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;news.none;authpriv.none
/var/log/messages

# The authpriv file has restricted access.
authpriv.*
/var/log/secure

# Log all the mail messages in one place.
mail.*
-/var/log/maillog

# Everybody gets emergency messages, plus log them on another
# machine.
*.emerg                                                 *

# Save mail and news errors of level err and higher in a
# special file.
uucp,news.crit
/var/log/spooler

#
# INN
#
#news.=crit
/var/log/news/news.crit
#news.=err
/var/log/news/news.err
#news.notice
/var/log/news/news.notice
news.=crit
/home/news/log/news.crit
news.=err
/home/news/log/news.err
news.notice
/home/news/log/news.notice

I'm not suprised you can't reproduce it 2 hours after I opened the
bug, since as I said earlier it often takes a week to show up.

Next time it shows up, I'll try an lsof and strace on the syslogd
program.

Comment 5 Bill Nottingham 1999-08-05 16:45:59 UTC

Hmm... does it show up right after anything in particular -
for example, log rotation?

Comment 6 ptomblin 1999-08-05 16:54:59 UTC

Yeah, it was pretty early in the morning, and pretty early in the log.
I'd show you the exact part of the log, but the machine isn't
responding to keystrokes in my window that's telnetted into it, and
it's not accepting remote logins all of a sudden.  I think it's
crashing, or my company's firewall is shitting itself again.  (Funnily
enough, the web server seems to still be running, though)

Comment 7 ptomblin 1999-08-05 22:34:59 UTC

Well, I found out why I couldn't log into my home machine before:
syslogd/klogd was well and truely hosed.  I tried to do a
/etc/rc.d/init.d/syslog restart (as I had done a few hours before to
clear the problem) and it hung while trying to shut down klogd.  There
is nothing in /var/log/messages after about 12:50.

Anyway, as you guessed, the initial problem started right about when
the logs rotated.  Here is the first part of /var/log/messages:
Aug  5 05:35:26 canoe syslogd 1.3-3: restart (remote reception).
Aug  5 05:35:27 canoe syslogd 1.3-3: restart (remote reception).
Aug  5 05:35:27 canoe syslogd: select: Bad file descriptor
Aug  5 05:35:29 canoe last message repeated 1145 times
Aug  5 05:35:29 canoe syslogd 1.3-3: restart (remote reception).
Aug  5 05:35:29 canoe syslogd: select: Bad file descriptor
Aug  5 05:35:30 canoe last message repeated 598 times
Aug  5 05:35:30 canoe syslogd 1.3-3: restart (remote reception).
Aug  5 05:35:30 canoe syslogd: select: Bad file descriptor
Aug  5 05:36:01 canoe last message repeated 12432 times
Aug  5 05:37:02 canoe last message repeated 50757 times

and here is where I restarted the first time:

Aug  5 10:11:52 canoe last message repeated 22379 times
Aug  5 10:11:52 canoe exiting on signal 15
Aug  5 10:11:58 canoe syslogd 1.3-3: restart (remote reception).
Aug  5 10:11:58 canoe syslog: syslogd startup succeeded
Aug  5 10:11:55 canoe syslog: syslogd shutdown succeeded
Aug  5 10:12:01 canoe syslog: klogd startup succeeded
Aug  5 10:12:01 canoe kernel: klogd 1.3-3, log source = ksyslog
started.
Aug  5 10:12:01 canoe kernel: Inspecting /boot/System.map-2.2.10
Aug  5 10:12:04 canoe kernel: Loaded 6189 symbols from
/boot/System.map-2.2.10.
Aug  5 10:12:05 canoe kernel: Symbols match kernel version 2.2.10.
Aug  5 10:12:05 canoe kernel: Error seeking in /dev/kmem
Aug  5 10:12:05 canoe kernel: Symbol #appletalk, value 00058000
Aug  5 10:12:05 canoe kernel: Error adding kernel module table entry.
Aug  5 10:12:12 cirrus identd[8242]: from: 132.235.8.44 (
oak.cats.ohiou.edu ) for: 64045, 25

and here is where it died:

Aug  5 12:29:13 canoe identd[24171]: Connection from localhost
Aug  5 12:29:13 canoe kernel: in.identd(24171): unaligned trap at
0000020000183c
4c: 000000011ffff574 2d 10
Aug  5 12:29:13 canoe identd[24171]: from: 127.0.0.1 ( localhost )
for: 2032, 25
Aug  5 12:29:13 canoe kernel: in.identd(24171): unaligned trap at
0000020000183c
4c: 000000011ffff574 2d 10
Aug  5 12:29:13 canoe last message repeated 2 times
Aug  5 12:36:56 canoe PAM_pwdb[24386]: (su) session opened for user
root by ptom
blin(uid=406)
Aug  5 14:18:27 canoe syslogd 1.3-3: restart (remote reception).

Even though I had a root login, I couldn't even do a reboot, since it
hung while trying to kill gpm.  I suspect the problem was that it
couldn't log what it was doing.

Comment 8 ptomblin 1999-08-06 13:37:59 UTC

The damn thing happened again this morning.  Again, it appears to have
happened while rotating logs.  There were three messages on my console
that weren't in any logfiles.  I didn't write down the exact message,
but it was something like "syslogd: couldn't open file, already in
use".  Is there some sort of race condition in the way logrotate
rotates one file, SIGHUPs syslogd, then rotates another file, SIGHUPs
syslogd, then rotates another file, and SIGHUPs syslogd?

This machine is a UDB, and it's running an innd news server, apache
web server, NFS server, netatalk+asun appleshare server, and a few
other things, so there are times when it gets a big overburdened.

Comment 9 Bill Nottingham 1999-08-06 14:09:59 UTC

Not that I know of, although I'll look again at the code.
I've tried rotating all the logs even while running
"find |logger" on a remote machine (albeit on a PC164) without
any problems.

Comment 10 ptomblin 1999-08-06 14:30:59 UTC

I did a run-parts /etc/cron.daily this morning (after rebooting) with
no ill effects, but it appears that logrotate didn't rotate the
syslogs because they are less than a full day old.

I wonder if it might be a better test to do the logrotate while
something is really hammering syslog so that it's slow responding to
the HUP.

Oh, I just remembered something which may or may not be cogent:
/var/log is a symlink to a directory on sdb2 (second partition on
second scsi disk), because / ran out of space the other day and I
didn't have a spare partition I could just mount as /var/log.  I did
that on July 27th, and the morning of August 5 is the first time I saw
the syslog file descriptor bug.

Comment 11 Bill Nottingham 1999-08-06 15:03:59 UTC

OK, I just played with multiple logrotate -f (force), while
building glibc, the kernel, and linuxconf, running a find both
locally and remotely, and piping them all through logger.

It still didn't do anything strange, aside from causing the machine
to run seriously low on disk space.

What kernel are you running?

Comment 12 ptomblin 1999-08-06 15:35:59 UTC

I've been running 2.2.10 for a week or so.  I'm going to try
2.2.10ac12 today.

But remember what I said earlier - this machine was running just fine
for over a week with RedHat 6.0 before it started getting this
problem.  But then again, RedHat never believed it when I reported the
EXACT same problem with RedHat 5.2, until BugTraq found and fixed a
stack overrun.

Comment 13 ptomblin 1999-08-06 16:37:59 UTC

Isn't that special?  After rebooting this morning with kernel
2.2.10ac12, I was sitting here at work logged into this alpha with a
tail -f /var/spool/messages and another window logged into it doing
something useful (reading news) when it suddenly stopped responding.
The last thing I see on my screen is

Aug  6 12:28:30 cirrus identd[9938]: Returned: 61059 , 25 : NO-USER
Aug  6 12:28:32 cirrus identd[9939]: from: 128.135.130.21 (
gsbpop.uchicago.edu ) for: 61061, 25

which freaks me out because 12:29 is when it stopped working
yesterday.  However, unlike yesterday I can't ping it or access the
web server, so I don't know what is going on.

Comment 14 Bill Nottingham 1999-08-09 19:08:59 UTC

I'm sending a patch to bugzilla (check under the e-mail link
on the page for this bug) which solves at least one problem
with klogd. This may not solve your problem though.

Comment 15 ptomblin 1999-08-09 20:01:59 UTC

I haven't seen the problem back since those two occurances on the 5th
and the 6th.  That's coincidentally when I started using 2.2.10ac12
instead of 2.2.10, but that could be coincidence.  The other hangup I
reported on the 6th was a random hang I get every few days as well,
probably hardware related.  (it's an Alpha UDB and it's showing its
age.)

Do you want me to get the source code and apply that patch, or can you
send me an RPM?

Oh, related question:  Should I be running klogd with the -s switch or
not?  I was, because I'd been told to when I was having the problem
with RedHat 5.2, but it didn't fix anything then.

Comment 16 Bill Nottingham 1999-08-09 20:30:59 UTC

That might not be a coincidence - AFAIR, Alan applied some
alpha-specific patches between -10 plain and ac12.

No, the -s option shouldn't be necessary.

If you want the RPM, I've put copies at:

ftp://people.redhat.com/notting/
and
http://people.redhat.com/notting/ftp/

(It will be in Raw Hide soon, but that might be a day or two...)

Comment 17 Bill Nottingham 1999-09-20 19:22:59 UTC

Apologies for the delay.

Has the problem repeated at all since the updated kernel?

------- Additional Comments From   09/20/99 20:13 -------
I never saw the problem again, but a few weeks ago I replaced that
Alpha machine with an Intel machine.

Comment 18 Guillaume Filion 2000-08-25 16:22:13 UTC

Sorry, but this problem does not seems to be resolved.

I've been able to reproduce it (without wanting to...) with RedHat 6.1 (i586), 
kernel-2.2.12-20 and glibc-2.1.2-11.

I was using sysklogd-1.3.31-14.i386, updating to sysklogd-1.3.31-16 did 
not solve the problem.

When syslogd starts, I get this in /var/log/messages:

Aug 24 09:53:19 sanitized syslogd 1.3-3: restart.
Aug 24 09:53:19 sanitized syslogd: select: Bad file descriptor
Aug 24 09:53:50 sanitized last message repeated 416955 times
Aug 24 09:54:51 sanitized last message repeated 860294 times
Aug 24 09:55:52 sanitized last message repeated 869058 times
Aug 24 09:56:53 sanitized last message repeated 870385 times
Aug 24 09:57:54 sanitized last message repeated 869721 times
Aug 24 09:58:55 sanitized last message repeated 869977 times
Aug 24 09:59:56 sanitized last message repeated 869792 times
Aug 24 10:00:57 sanitized last message repeated 862309 times
Aug 24 10:01:58 sanitized last message repeated 851399 times
Aug 24 10:02:59 sanitized last message repeated 862334 times
Aug 24 10:04:00 sanitized last message repeated 859975 times
Aug 24 10:05:01 sanitized last message repeated 859789 times
Aug 24 10:06:02 sanitized last message repeated 863366 times
Aug 24 10:07:03 sanitized last message repeated 868746 times
Aug 24 10:08:04 sanitized last message repeated 868313 times
Aug 24 10:09:05 sanitized last message repeated 868068 times
Aug 24 10:10:06 sanitized last message repeated 867668 times
Aug 24 10:11:07 sanitized last message repeated 870930 times
Aug 24 10:12:08 sanitized last message repeated 870589 times
Aug 24 10:13:09 sanitized last message repeated 870940 times
Aug 24 10:14:10 sanitized last message repeated 870386 times
Aug 24 10:15:11 sanitized last message repeated 871397 times
Aug 24 10:16:12 sanitized last message repeated 877305 times
Aug 24 10:17:13 sanitized last message repeated 876981 times
Aug 24 10:18:14 sanitized last message repeated 877334 times

and my CPU load peeks up at about 10 until syslogd or the kernel 
crashes.

Could someone with the right privileges reopen this bug, set platform to 
All and version to 6.1 (or tell me how to solve this bug...).

Thanks,
GFK's

Comment 19 Guillaume Filion 2000-08-25 16:41:26 UTC

Okay, I started playing with the init.d script that starts syslogd and realised 
that the command to start syslog was:
daemon syslogd -m 0 -a /home/dns/dev/log
(For those wondering, it's bastille-linux that did that, and I don't even have 
bind installed... grrr).

The folder /home/dns doesn't exits anymore, I deleted it months ago when 
I realised bastille created it. This was a mistake.

2 days ago syslogd started to do the mess described above. Changing 
the command starting syslog to:
daemon syslogd -m 0
solved the problem.

Maibe a fix for this could be that if someone adds a non existing socket, 
syslog doesn't fall into an infinite loop.

Hope this helps clarify the problem,
GFK's

Note You need to log in before you can comment on or make changes to this bug.